CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Medical imaging data engineering pipeline for standardizing diverse datasets (CT, MRI, PET) into a unified NIfTI format with consistent JSON metadata. Each subdirectory handles one dataset (AbdomenAtlas, BRATS, MnM2, OASIS, OAI_ZIB, PSMA, Kaggle OSIC, etc.).
Running Data Cleaning Scripts
Each dataset has its own dataclean_*.py script. Run from the dataset's subdirectory:
python dataclean_abdomen_atlas.py --target_path /path/to/raw/data --output_dir /path/to/output
All scripts follow the same --target_path / --output_dir argument pattern. Versioned scripts (e.g., _v2.py, _v3.py) represent iterative improvements; use the highest version unless investigating regressions.
Dependencies
Python 3 with: SimpleITK, pandas, numpy, tqdm, openpyxl (for Excel metadata). No requirements.txt exists β install manually.
Architecture
Processing Pipeline (per dataset)
- Load raw data (DICOM via
sitk.ImageSeriesReader, NIfTI viasitk.ReadImage, or NRRD) - Extract metadata from headers, CSV files, or DICOM tags
- Resample to isotropic spacing using minimum voxel spacing (
get_unisize_resampler) - Clamp intensities β CT:
[-300, 300]HU; MRI: varies per dataset - Process segmentation labels with identical resampling (nearest-neighbor interpolation)
- Validate image/label dimension alignment via
assertonGetSize() - Write standardized NIfTI (
.nii.gz) + append tonifti_mappings.json
Key Shared Components
util.py (copied into each dataset directory β not a shared import):
meta_dataclass β validates metadata againstconfig_format.jsonschema, enforces required fields (Modality, OriImg_path, Spacing_mm, Size, Dataset_name), normalizes ambiguous terminology via synonym dictionariesget_unisize_resampler()β builds a SimpleITK resampler for isotropic spacing; returnsNoneif spacing is already isotropicclamp_image()β HU/intensity clamping viasitk.ClampImageFilterget_synonyms_dict()/replace_synonyms()β canonical mapping for ROI names, tissue labels, modalities, and task typesload_nifti(),load_dicom_images(),save_nifti()β I/O wrappers that embedFolderPathmetadata in NIfTI headers
config_format.json (per dataset directory): defines the metadata schema β field types, required flags, and allowed option values.
Output Structure
{output_dir}/{patient_id}/{patient_id}.nii.gz # processed image
{output_dir}/{patient_id}/{task}/{tissue}.nii.gz # segmentation labels
{output_dir}/nifti_mappings.json # metadata keyed by output path
{output_dir}/failed_files.json # files that failed processing
Dataset-Specific Notes
- AbdomenAtlas: 25-organ segmentation labels stored as individual NIfTI files per organ; also has
combined_labels.nii.gz(values 0-25) - BRATS (2019/2020/2021): Multi-modal MRI (FLAIR, T1, T1ce, T2) β each modality processed as a separate sub-modality entry
- MnM2/MnMs: Cardiac MRI with vendor metadata (Siemens, Philips, GE, Canon)
- OASIS: Both cross-sectional and longitudinal variants; includes clinical scores (MMSE, CDR)
- OAI_ZIB: Knee MRI with 6-structure segmentation and clinical grading (WOMAC)
- PSMA: Dual-tracer PET/CT (PSMA & FDG); has longitudinal variant
Important Conventions
- Resampling uses the minimum of the original spacing values to create isotropic voxels
- Labels are resampled with nearest-neighbor interpolation; images use linear
- The
meta_dataclass normalizes terminology automatically β e.g., "chest" maps to "thorax", "seg" maps to "segmentation" util.pyis duplicated across directories (not shared via import) β changes must be propagated manually- Code comments and docstrings are frequently in Chinese
- Log files (
*.log) in each directory contain processing run history β these can be large (up to 23 MB)