KPLabs's picture
Upload folder using huggingface_hub
bc7df31 verified
# Methane Benchmark Dataset (PINEAPPLE + Clean)
This folder contains the **Methane Benchmark Dataset** in two variants:
- **balanced**: a balanced mix of methane and non-methane patches
- **clean**: **no-methane only** (negative patches)
The dataset combines multiple modalities (HSI and RGB), **simulated Sentinel-2 BOA reflectance (S2 BOA refl)** derived from HSI, **TerraMind TiM-generated products** (including **S2L2A** and **LULC**), text captions, and labels produced by different sources (LLM, human, and TiM/TerraMind). The clean split additionally contains **Intuition-1 simulated data**.
---
## 1. Dataset overview
### 1.1 balanced (PINEAPPLE: methane + non-methane)
- **178 patches**, **27 flights**
- **HSI**: AVIRIS-NG
- **RGB**: RGB renderings / visualizations aligned with the patches
- **Simulated Sentinel-2 (BOA reflectance)**: derived from HSI and stored under `simulated_s2_boarefl_balanced/`
- **TerraMind TiM products** (derived from simulated S2 BOA reflectance; stored under `tim_generation_balanced/`):
- **S2L2A** (TiM-generated)
- **LULC** (TiM-generated, pixel-level)
- Plots and auxiliary outputs
- **Annotations**
- Urban vs. non-urban (image-level): **LLM**
- Urban vs. non-urban (image-level): **human**
- Textual description: **LLM**
### 1.2 clean (no-methane only)
- **261 patches** (neighboring patches; center patch excluded), **20 flights**
- **HSI**: AVIRIS-NG
- **RGB**: RGB renderings / visualizations aligned with the patches
- **Simulated Sentinel-2 (BOA reflectance)**: derived from HSI and stored under `simulated_s2_boareflclean/` (folder name preserved as exported)
- **TerraMind TiM products** (derived from simulated S2 BOA reflectance; stored under `tim_generation_clean/`):
- **S2L2A** (TiM-generated)
- **LULC** (TiM-generated, pixel-level)
- Plots and auxiliary outputs
- **Intuition-1 simulated data (clean only)**: additional simulated modality for extended ablations and robustness checks (see notes in Section 2)
- **Annotations**
- Urban vs. non-urban (image-level): **LLM**
- Urban vs. non-urban (image-level): **human**
- Textual description: **LLM**
---
## 2. Folder structure
Top-level directories:
- `aviris_hsi_balanced/`
AVIRIS-NG hyperspectral patches for the balanced split.
- `aviris_hsi_clean/`
AVIRIS-NG hyperspectral patches for the clean (no-methane) split.
- `rgb_balanced/`
RGB images for the balanced split (aligned to patches).
- `rgb_clean/`
RGB images for the clean split (aligned to patches).
- `captions_balanced/`
LLM-generated text captions/descriptions for the balanced split.
- `captions_clean/`
LLM-generated text captions/descriptions for the clean split.
- `simulated_s2_boarefl_balanced/`
Simulated Sentinel-2 BOA reflectance images for the balanced split (simulated from HSI).
- `simulated_s2_boareflclean/`
Simulated Sentinel-2 BOA reflectance images for the clean split (simulated from HSI; folder name preserved as exported).
- `tim_generation_balanced/`
TerraMind TiM outputs generated from simulated S2 BOA reflectance (balanced split).
Contains (at least): `s2l2a/`, `lulc/`, `classes/`, `plots/`, and auxiliary files (e.g., a legend script).
- `tim_generation_clean/`
TerraMind TiM outputs generated from simulated S2 BOA reflectance (clean split).
Contains the same product types as the balanced split.
- `I1_simulation`
Additional Intuition-1 simulated data aligned with clean split patches.
Other files:
- `truth_false_labels.xlsx`
A compact label file (yes/no style) aggregating selected annotations (LLM, human, TiM classes), depending on your export.
---
## 3. Labels and annotation sources
The dataset provides yes/no labels and/or categorical classes from the following sources:
### 3.1 LLM labels (image-level)
- Urban vs. non-urban classification at image/patch level
- Stored in the exported label file and/or per-sample metadata (depending on your pipeline)
### 3.2 Human labels (image-level)
- Urban vs. non-urban classification at image/patch level
- Available for at least the clean split (and optionally balanced, depending on the export)
### 3.3 TerraMind TiM products (pixel-level and per-image products)
- **S2L2A** generated by TerraMind TiM from simulated S2 BOA reflectance
- **LULC** (pixel-level) generated by TerraMind TiM from simulated S2 BOA reflectance
- Stored under `tim_generation_*` (subfolders `s2l2a/`, `lulc/`, and `classes/`)
---
## 4. Modality relationships
- **HSI (AVIRIS-NG)** is the primary observation modality.
- **RGB** is a visualization or derived view aligned to the same patch footprint.
- **Simulated Sentinel-2 BOA reflectance (S2 BOA refl)** is simulated from HSI and used as input to TiM/TerraMind.
- **S2L2A** is not directly stored as a standalone raw simulation in the root; it is produced by **TerraMind TiM** and stored inside `tim_generation_*`.
- **LULC** is produced by **TerraMind TiM** (pixel-level) and stored inside `tim_generation_*`.
- **Captions** provide text descriptions for multimodal experiments (retrieval, captioning, instruction-following, VLM/LLM alignment).
- **Intuition-1 simulated data** (clean only) provides an extra modality for robustness and domain-shift experiments.
---
## 5. Warning
Before using check dataset class if there was any changes with naming convention of the files.