KPLabs's picture
Upload folder using huggingface_hub
bc7df31 verified

Methane Benchmark Dataset (PINEAPPLE + Clean)

This folder contains the Methane Benchmark Dataset in two variants:

  • balanced: a balanced mix of methane and non-methane patches
  • clean: no-methane only (negative patches)

The dataset combines multiple modalities (HSI and RGB), simulated Sentinel-2 BOA reflectance (S2 BOA refl) derived from HSI, TerraMind TiM-generated products (including S2L2A and LULC), text captions, and labels produced by different sources (LLM, human, and TiM/TerraMind). The clean split additionally contains Intuition-1 simulated data.


1. Dataset overview

1.1 balanced (PINEAPPLE: methane + non-methane)

  • 178 patches, 27 flights
  • HSI: AVIRIS-NG
  • RGB: RGB renderings / visualizations aligned with the patches
  • Simulated Sentinel-2 (BOA reflectance): derived from HSI and stored under simulated_s2_boarefl_balanced/
  • TerraMind TiM products (derived from simulated S2 BOA reflectance; stored under tim_generation_balanced/):
    • S2L2A (TiM-generated)
    • LULC (TiM-generated, pixel-level)
    • Plots and auxiliary outputs
  • Annotations
    • Urban vs. non-urban (image-level): LLM
    • Urban vs. non-urban (image-level): human
    • Textual description: LLM

1.2 clean (no-methane only)

  • 261 patches (neighboring patches; center patch excluded), 20 flights
  • HSI: AVIRIS-NG
  • RGB: RGB renderings / visualizations aligned with the patches
  • Simulated Sentinel-2 (BOA reflectance): derived from HSI and stored under simulated_s2_boareflclean/ (folder name preserved as exported)
  • TerraMind TiM products (derived from simulated S2 BOA reflectance; stored under tim_generation_clean/):
    • S2L2A (TiM-generated)
    • LULC (TiM-generated, pixel-level)
    • Plots and auxiliary outputs
  • Intuition-1 simulated data (clean only): additional simulated modality for extended ablations and robustness checks (see notes in Section 2)
  • Annotations
    • Urban vs. non-urban (image-level): LLM
    • Urban vs. non-urban (image-level): human
    • Textual description: LLM

2. Folder structure

Top-level directories:

  • aviris_hsi_balanced/
    AVIRIS-NG hyperspectral patches for the balanced split.

  • aviris_hsi_clean/
    AVIRIS-NG hyperspectral patches for the clean (no-methane) split.

  • rgb_balanced/
    RGB images for the balanced split (aligned to patches).

  • rgb_clean/
    RGB images for the clean split (aligned to patches).

  • captions_balanced/
    LLM-generated text captions/descriptions for the balanced split.

  • captions_clean/
    LLM-generated text captions/descriptions for the clean split.

  • simulated_s2_boarefl_balanced/
    Simulated Sentinel-2 BOA reflectance images for the balanced split (simulated from HSI).

  • simulated_s2_boareflclean/
    Simulated Sentinel-2 BOA reflectance images for the clean split (simulated from HSI; folder name preserved as exported).

  • tim_generation_balanced/
    TerraMind TiM outputs generated from simulated S2 BOA reflectance (balanced split).
    Contains (at least): s2l2a/, lulc/, classes/, plots/, and auxiliary files (e.g., a legend script).

  • tim_generation_clean/
    TerraMind TiM outputs generated from simulated S2 BOA reflectance (clean split).
    Contains the same product types as the balanced split.

  • I1_simulation Additional Intuition-1 simulated data aligned with clean split patches.

Other files:

  • truth_false_labels.xlsx
    A compact label file (yes/no style) aggregating selected annotations (LLM, human, TiM classes), depending on your export.

3. Labels and annotation sources

The dataset provides yes/no labels and/or categorical classes from the following sources:

3.1 LLM labels (image-level)

  • Urban vs. non-urban classification at image/patch level
  • Stored in the exported label file and/or per-sample metadata (depending on your pipeline)

3.2 Human labels (image-level)

  • Urban vs. non-urban classification at image/patch level
  • Available for at least the clean split (and optionally balanced, depending on the export)

3.3 TerraMind TiM products (pixel-level and per-image products)

  • S2L2A generated by TerraMind TiM from simulated S2 BOA reflectance
  • LULC (pixel-level) generated by TerraMind TiM from simulated S2 BOA reflectance
  • Stored under tim_generation_* (subfolders s2l2a/, lulc/, and classes/)

4. Modality relationships

  • HSI (AVIRIS-NG) is the primary observation modality.
  • RGB is a visualization or derived view aligned to the same patch footprint.
  • Simulated Sentinel-2 BOA reflectance (S2 BOA refl) is simulated from HSI and used as input to TiM/TerraMind.
  • S2L2A is not directly stored as a standalone raw simulation in the root; it is produced by TerraMind TiM and stored inside tim_generation_*.
  • LULC is produced by TerraMind TiM (pixel-level) and stored inside tim_generation_*.
  • Captions provide text descriptions for multimodal experiments (retrieval, captioning, instruction-following, VLM/LLM alignment).
  • Intuition-1 simulated data (clean only) provides an extra modality for robustness and domain-shift experiments.

5. Warning

Before using check dataset class if there was any changes with naming convention of the files.