Data_Engineering / CLAUDE.md
maxmo2009's picture
Initial upload: data cleanup pipeline for 12 medical imaging datasets
da9fb1e verified

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.).

Running Data Cleaning Scripts

Each dataset has its own dataclean_*.py script. Run from the dataset's subdirectory:

python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output

All scripts follow the same --target_path / --output_dir argument pattern. Versioned scripts (e.g., _v2.py, _v3.py) represent iterative improvements; use the highest version unless investigating regressions.

Dependencies

Python 3 with: SimpleITK, pandas, numpy, tqdm, openpyxl (for Excel metadata). No requirements.txt exists β€” install manually.

Architecture

Processing Pipeline (per dataset)

  1. Load raw data (DICOM via sitk.ImageSeriesReader, NIfTI via sitk.ReadImage, or NRRD)
  2. Extract metadata from headers, CSV files, or DICOM tags
  3. Resample to isotropic spacing using minimum voxel spacing (get_unisize_resampler)
  4. Clamp intensities β€” CT: [-300, 300] HU; MRI: varies per dataset
  5. Process segmentation labels with identical resampling (nearest-neighbor interpolation)
  6. Validate image/label dimension alignment via assert on GetSize()
  7. Write standardized NIfTI (.nii.gz) + append to nifti_mappings.json

Key Shared Components

util.py (copied into each dataset directory β€” not a shared import):

  • meta_data class β€” validates metadata against config_format.json schema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionaries
  • get_unisize_resampler() β€” builds a SimpleITK resampler for isotropic spacing; returns None if spacing is already isotropic
  • clamp_image() β€” HU/intensity clamping via sitk.ClampImageFilter
  • get_synonyms_dict() / replace_synonyms() β€” canonical mapping for ROI names, tissue labels, modalities, and task types
  • load_nifti(), load_dicom_images(), save_nifti() β€” I/O wrappers that embed FolderPath metadata in NIfTI headers

config_format.json (per dataset directory): defines the metadata schema β€” field types, required flags, and allowed option values.

Output Structure

{output_dir}/{patient_id}/{patient_id}.nii.gz          # processed image
{output_dir}/{patient_id}/{task}/{tissue}.nii.gz        # segmentation labels
{output_dir}/nifti_mappings.json                        # metadata keyed by output path
{output_dir}/failed_files.json                          # files that failed processing

Dataset-Specific Notes

  • AbdomenAtlas: 25-organ segmentation labels stored as individual NIfTI files per organ; also has combined_labels.nii.gz (values 0-25)
  • BRATS (2019/2020/2021): Multi-modal MRI (FLAIR, T1, T1ce, T2) β€” each modality processed as a separate sub-modality entry
  • MnM2/MnMs: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon)
  • OASIS: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR)
  • OAI_ZIB: Knee MRI with 6-structure segmentation and clinical grading (WOMAC)
  • PSMA: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant

Important Conventions

  • Resampling uses the minimum of the original spacing values to create isotropic voxels
  • Labels are resampled with nearest-neighbor interpolation; images use linear
  • The meta_data class normalizes terminology automatically β€” e.g., "chest" maps to "thorax", "seg" maps to "segmentation"
  • util.py is duplicated across directories (not shared via import) β€” changes must be propagated manually
  • Code comments and docstrings are frequently in Chinese
  • Log files (*.log) in each directory contain processing run history β€” these can be large (up to 23 MB)