--- license: mit tags: - medical-imaging - data-engineering - preprocessing - nifti - dicom - simpleitk library_name: simpleitk --- # Data_Engineering — Medical Imaging Cleanup Pipeline Standardize diverse medical imaging datasets (CT, MRI, PET) into a unified **NIfTI** format with consistent JSON metadata. Each subdirectory targets one dataset. > Companion repo to [`DRDMsig/Omini3D`](https://huggingface.co/DRDMsig/Omini3D) — produces the standardized data that OmniMorph trains on. ## Supported Datasets | Subdirectory | Dataset | Modality | |---|---|---| | `AbdomenAtlas/` | AbdomenAtlas | CT | | `AbdomenCT1k/` | AbdomenCT-1K | CT | | `brats2019_clean/` | BraTS 2019 | MRI (multi-sequence) | | `brats2020_clean/` | BraTS 2020 | MRI (multi-sequence) | | `brats2021_clean/` | BraTS 2021 | MRI (multi-sequence) | | `kaggle_osic_clean/` | Kaggle OSIC Pulmonary Fibrosis | CT | | `MnM2_clean/` | M&Ms-2 | Cardiac MRI | | `MnMs_clean/` | M&Ms | Cardiac MRI | | `OAISIS_clean/` | OASIS-1 / OASIS-2 | Brain MRI | | `OAI_ZIB_clean/` | OAI-ZIB (knee) | MRI | | `PSMA_clean/` | PSMA-FDG PET-CT (longitudinal) | PET + CT | | `all/` | Cross-dataset utilities (artifact plane removal) | — | Each cleaned dataset writes: - Resampled & clamped `.nii.gz` images / segmentations - Per-dataset `nifti_mappings.json` - `failed_files.json` listing files the cleaner could not process ## Repository Layout ``` _clean/ ├── dataclean_.py # main cleanup script (use highest version: _v2.py, _v3.py, ...) ├── util.py # shared helpers (copied per dir, not imported) ├── config_format.json # metadata schema for `meta_data` validation └── (optional) sample/, demo/ # tiny example NIfTI files for sanity checks ``` ## Usage ```bash cd AbdomenAtlas/ python dataclean_abdomen_atlas_v2.py \ --target_path /path/to/raw/AbdomenAtlas \ --output_dir /path/to/output/AbdomenAtlas_clean ``` All scripts share the `--target_path` / `--output_dir` interface. Versioned scripts (`_v2.py`, `_v3.py`) supersede older versions; use the highest version unless investigating regressions. ### Pipeline (per dataset) 1. **Load** raw data (DICOM via `sitk.ImageSeriesReader`, NIfTI via `sitk.ReadImage`, NRRD). 2. **Extract metadata** from headers, CSV files, or DICOM tags. 3. **Resample** to isotropic spacing (`get_unisize_resampler` in `util.py`). 4. **Clamp intensities** — CT: `[-300, 300]` HU; MRI: per-dataset windows. 5. **Process segmentation labels** with identical resampling (nearest-neighbor). 6. **Validate** image/label dimensions agree (`assert image.GetSize() == label.GetSize()`). 7. **Write** standardized `.nii.gz` and append to `nifti_mappings.json`. ### Shared `util.py` API | Function / class | Purpose | |---|---| | `meta_data` | Validates metadata against `config_format.json`; required fields: `Modality`, `OriImg_path`, `Spacing_mm`, `Size`, `Dataset_name`. Normalizes ambiguous terminology via synonym dictionaries. | | `get_unisize_resampler(image)` | Builds a SimpleITK resampler for isotropic spacing; returns `None` if already isotropic. | | `clamp_image(image, lo, hi)` | HU/intensity clamping via `sitk.ClampImageFilter`. | ## Dependencies ```bash pip install SimpleITK pandas numpy tqdm openpyxl ``` (No `requirements.txt` — install manually.) ## What's Included / Excluded - ✅ Cleanup scripts, `util.py`, `config_format.json`, demographic CSVs. - ✅ A handful of tiny demo / sample `.nii.gz` files in `PSMA_clean/{sample,demo}/`. - ❌ Raw datasets (download from each dataset's official source). - ❌ Run logs from prior cleanup runs (`*.log`). - ❌ Intermediate test outputs (`MnM2_clean/test/`). ## License MIT — see project root.