Methane Benchmark Dataset (PINEAPPLE + Clean)
This folder contains the Methane Benchmark Dataset in two variants:
- balanced: a balanced mix of methane and non-methane patches
- clean: no-methane only (negative patches)
The dataset combines multiple modalities (HSI and RGB), simulated Sentinel-2 BOA reflectance (S2 BOA refl) derived from HSI, TerraMind TiM-generated products (including S2L2A and LULC), text captions, and labels produced by different sources (LLM, human, and TiM/TerraMind). The clean split additionally contains Intuition-1 simulated data.
1. Dataset overview
1.1 balanced (PINEAPPLE: methane + non-methane)
- 178 patches, 27 flights
- HSI: AVIRIS-NG
- RGB: RGB renderings / visualizations aligned with the patches
- Simulated Sentinel-2 (BOA reflectance): derived from HSI and stored under
simulated_s2_boarefl_balanced/ - TerraMind TiM products (derived from simulated S2 BOA reflectance; stored under
tim_generation_balanced/):- S2L2A (TiM-generated)
- LULC (TiM-generated, pixel-level)
- Plots and auxiliary outputs
- Annotations
- Urban vs. non-urban (image-level): LLM
- Urban vs. non-urban (image-level): human
- Textual description: LLM
1.2 clean (no-methane only)
- 261 patches (neighboring patches; center patch excluded), 20 flights
- HSI: AVIRIS-NG
- RGB: RGB renderings / visualizations aligned with the patches
- Simulated Sentinel-2 (BOA reflectance): derived from HSI and stored under
simulated_s2_boareflclean/(folder name preserved as exported) - TerraMind TiM products (derived from simulated S2 BOA reflectance; stored under
tim_generation_clean/):- S2L2A (TiM-generated)
- LULC (TiM-generated, pixel-level)
- Plots and auxiliary outputs
- Intuition-1 simulated data (clean only): additional simulated modality for extended ablations and robustness checks (see notes in Section 2)
- Annotations
- Urban vs. non-urban (image-level): LLM
- Urban vs. non-urban (image-level): human
- Textual description: LLM
2. Folder structure
Top-level directories:
aviris_hsi_balanced/
AVIRIS-NG hyperspectral patches for the balanced split.aviris_hsi_clean/
AVIRIS-NG hyperspectral patches for the clean (no-methane) split.rgb_balanced/
RGB images for the balanced split (aligned to patches).rgb_clean/
RGB images for the clean split (aligned to patches).captions_balanced/
LLM-generated text captions/descriptions for the balanced split.captions_clean/
LLM-generated text captions/descriptions for the clean split.simulated_s2_boarefl_balanced/
Simulated Sentinel-2 BOA reflectance images for the balanced split (simulated from HSI).simulated_s2_boareflclean/
Simulated Sentinel-2 BOA reflectance images for the clean split (simulated from HSI; folder name preserved as exported).tim_generation_balanced/
TerraMind TiM outputs generated from simulated S2 BOA reflectance (balanced split).
Contains (at least):s2l2a/,lulc/,classes/,plots/, and auxiliary files (e.g., a legend script).tim_generation_clean/
TerraMind TiM outputs generated from simulated S2 BOA reflectance (clean split).
Contains the same product types as the balanced split.I1_simulationAdditional Intuition-1 simulated data aligned with clean split patches.
Other files:
truth_false_labels.xlsx
A compact label file (yes/no style) aggregating selected annotations (LLM, human, TiM classes), depending on your export.
3. Labels and annotation sources
The dataset provides yes/no labels and/or categorical classes from the following sources:
3.1 LLM labels (image-level)
- Urban vs. non-urban classification at image/patch level
- Stored in the exported label file and/or per-sample metadata (depending on your pipeline)
3.2 Human labels (image-level)
- Urban vs. non-urban classification at image/patch level
- Available for at least the clean split (and optionally balanced, depending on the export)
3.3 TerraMind TiM products (pixel-level and per-image products)
- S2L2A generated by TerraMind TiM from simulated S2 BOA reflectance
- LULC (pixel-level) generated by TerraMind TiM from simulated S2 BOA reflectance
- Stored under
tim_generation_*(subfolderss2l2a/,lulc/, andclasses/)
4. Modality relationships
- HSI (AVIRIS-NG) is the primary observation modality.
- RGB is a visualization or derived view aligned to the same patch footprint.
- Simulated Sentinel-2 BOA reflectance (S2 BOA refl) is simulated from HSI and used as input to TiM/TerraMind.
- S2L2A is not directly stored as a standalone raw simulation in the root; it is produced by TerraMind TiM and stored inside
tim_generation_*. - LULC is produced by TerraMind TiM (pixel-level) and stored inside
tim_generation_*. - Captions provide text descriptions for multimodal experiments (retrieval, captioning, instruction-following, VLM/LLM alignment).
- Intuition-1 simulated data (clean only) provides an extra modality for robustness and domain-shift experiments.
5. Warning
Before using check dataset class if there was any changes with naming convention of the files.