# Methane Benchmark Dataset (PINEAPPLE + Clean) This folder contains the **Methane Benchmark Dataset** in two variants: - **balanced**: a balanced mix of methane and non-methane patches - **clean**: **no-methane only** (negative patches) The dataset combines multiple modalities (HSI and RGB), **simulated Sentinel-2 BOA reflectance (S2 BOA refl)** derived from HSI, **TerraMind TiM-generated products** (including **S2L2A** and **LULC**), text captions, and labels produced by different sources (LLM, human, and TiM/TerraMind). The clean split additionally contains **Intuition-1 simulated data**. --- ## 1. Dataset overview ### 1.1 balanced (PINEAPPLE: methane + non-methane) - **178 patches**, **27 flights** - **HSI**: AVIRIS-NG - **RGB**: RGB renderings / visualizations aligned with the patches - **Simulated Sentinel-2 (BOA reflectance)**: derived from HSI and stored under `simulated_s2_boarefl_balanced/` - **TerraMind TiM products** (derived from simulated S2 BOA reflectance; stored under `tim_generation_balanced/`): - **S2L2A** (TiM-generated) - **LULC** (TiM-generated, pixel-level) - Plots and auxiliary outputs - **Annotations** - Urban vs. non-urban (image-level): **LLM** - Urban vs. non-urban (image-level): **human** - Textual description: **LLM** ### 1.2 clean (no-methane only) - **261 patches** (neighboring patches; center patch excluded), **20 flights** - **HSI**: AVIRIS-NG - **RGB**: RGB renderings / visualizations aligned with the patches - **Simulated Sentinel-2 (BOA reflectance)**: derived from HSI and stored under `simulated_s2_boareflclean/` (folder name preserved as exported) - **TerraMind TiM products** (derived from simulated S2 BOA reflectance; stored under `tim_generation_clean/`): - **S2L2A** (TiM-generated) - **LULC** (TiM-generated, pixel-level) - Plots and auxiliary outputs - **Intuition-1 simulated data (clean only)**: additional simulated modality for extended ablations and robustness checks (see notes in Section 2) - **Annotations** - Urban vs. non-urban (image-level): **LLM** - Urban vs. non-urban (image-level): **human** - Textual description: **LLM** --- ## 2. Folder structure Top-level directories: - `aviris_hsi_balanced/` AVIRIS-NG hyperspectral patches for the balanced split. - `aviris_hsi_clean/` AVIRIS-NG hyperspectral patches for the clean (no-methane) split. - `rgb_balanced/` RGB images for the balanced split (aligned to patches). - `rgb_clean/` RGB images for the clean split (aligned to patches). - `captions_balanced/` LLM-generated text captions/descriptions for the balanced split. - `captions_clean/` LLM-generated text captions/descriptions for the clean split. - `simulated_s2_boarefl_balanced/` Simulated Sentinel-2 BOA reflectance images for the balanced split (simulated from HSI). - `simulated_s2_boareflclean/` Simulated Sentinel-2 BOA reflectance images for the clean split (simulated from HSI; folder name preserved as exported). - `tim_generation_balanced/` TerraMind TiM outputs generated from simulated S2 BOA reflectance (balanced split). Contains (at least): `s2l2a/`, `lulc/`, `classes/`, `plots/`, and auxiliary files (e.g., a legend script). - `tim_generation_clean/` TerraMind TiM outputs generated from simulated S2 BOA reflectance (clean split). Contains the same product types as the balanced split. - `I1_simulation` Additional Intuition-1 simulated data aligned with clean split patches. Other files: - `truth_false_labels.xlsx` A compact label file (yes/no style) aggregating selected annotations (LLM, human, TiM classes), depending on your export. --- ## 3. Labels and annotation sources The dataset provides yes/no labels and/or categorical classes from the following sources: ### 3.1 LLM labels (image-level) - Urban vs. non-urban classification at image/patch level - Stored in the exported label file and/or per-sample metadata (depending on your pipeline) ### 3.2 Human labels (image-level) - Urban vs. non-urban classification at image/patch level - Available for at least the clean split (and optionally balanced, depending on the export) ### 3.3 TerraMind TiM products (pixel-level and per-image products) - **S2L2A** generated by TerraMind TiM from simulated S2 BOA reflectance - **LULC** (pixel-level) generated by TerraMind TiM from simulated S2 BOA reflectance - Stored under `tim_generation_*` (subfolders `s2l2a/`, `lulc/`, and `classes/`) --- ## 4. Modality relationships - **HSI (AVIRIS-NG)** is the primary observation modality. - **RGB** is a visualization or derived view aligned to the same patch footprint. - **Simulated Sentinel-2 BOA reflectance (S2 BOA refl)** is simulated from HSI and used as input to TiM/TerraMind. - **S2L2A** is not directly stored as a standalone raw simulation in the root; it is produced by **TerraMind TiM** and stored inside `tim_generation_*`. - **LULC** is produced by **TerraMind TiM** (pixel-level) and stored inside `tim_generation_*`. - **Captions** provide text descriptions for multimodal experiments (retrieval, captioning, instruction-following, VLM/LLM alignment). - **Intuition-1 simulated data** (clean only) provides an extra modality for robustness and domain-shift experiments. --- ## 5. Warning Before using check dataset class if there was any changes with naming convention of the files.