Upload README.md
Browse files
README.md
CHANGED
|
@@ -2,12 +2,24 @@
|
|
| 2 |
|
| 3 |
## TL;DR
|
| 4 |
|
| 5 |
-
We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **
|
| 6 |
|
| 7 |
- **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
|
| 8 |
- **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
|
| 9 |
- **Task**: Binary classification (benign vs malignant)
|
| 10 |
-
- **Architecture**: SwinV2-Base (
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
---
|
| 13 |
|
|
@@ -32,15 +44,15 @@ We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/B
|
|
| 32 |
|
| 33 |
| Split | Images | Benign (0) | Malignant (1) |
|
| 34 |
|-------|--------|-----------|---------------|
|
| 35 |
-
| Train |
|
| 36 |
-
|
|
| 37 |
-
| Test | 623 | 358 | 265 |
|
| 38 |
|
| 39 |
-
- **Modality**: Grayscale ultrasound
|
| 40 |
- **Image sizes**: Variable (~270Γ270 to ~510Γ370)
|
| 41 |
- **Class balance**: ~62% benign, ~38% malignant
|
| 42 |
|
| 43 |
-
We
|
| 44 |
|
| 45 |
---
|
| 46 |
|
|
@@ -53,8 +65,6 @@ We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for severa
|
|
| 53 |
3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features
|
| 54 |
4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks
|
| 55 |
|
| 56 |
-
The pretrained classifier head (1000 classes) was replaced with a 2-class head for benign/malignant classification. All backbone weights were fine-tuned end-to-end.
|
| 57 |
-
|
| 58 |
### Training Configuration
|
| 59 |
|
| 60 |
| Hyperparameter | Value |
|
|
@@ -63,91 +73,97 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
|
|
| 63 |
| Batch size | 16 per device |
|
| 64 |
| Gradient accumulation | 2 steps |
|
| 65 |
| Effective batch size | 32 |
|
| 66 |
-
| Epochs | 30 (
|
| 67 |
| Warmup steps | 100 |
|
| 68 |
| Weight decay | 0.01 |
|
| 69 |
| Optimizer | AdamW |
|
| 70 |
| Precision | bf16 |
|
| 71 |
| Augmentation | Random rotation (Β±10Β°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
|
|
|
|
| 72 |
|
| 73 |
---
|
| 74 |
|
| 75 |
-
## Results
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
---
|
| 101 |
|
| 102 |
## Comparison with Published Benchmarks
|
| 103 |
|
| 104 |
-
| Model / Study | Year | Dataset | AUC | Accuracy |
|
| 105 |
-
|---------------|------|---------|-----|----------|-----|-------|
|
| 106 |
-
| **Human Radiologists** | 2025 | 100 nodules | β | β |
|
| 107 |
-
| **ResNet-18 Baseline** | 2025 | TN3K | β | ~80% |
|
| 108 |
-
| **PEMV-Thyroid** | 2025 | TN3K | β | 82.08% |
|
| 109 |
-
| **PEMV-Thyroid** | 2025 | TN5000 | β | 86.50% |
|
| 110 |
-
| **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | β |
|
| 111 |
-
| **FM_UIA Baseline** | 2026 | FM_UIA | 91.55%
|
| 112 |
-
| **Ours (SwinV2
|
| 113 |
|
| 114 |
### Key Observations
|
| 115 |
|
| 116 |
-
1. **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
-
|
|
|
|
|
|
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
---
|
| 127 |
|
| 128 |
## Clinical Relevance and Limitations
|
| 129 |
|
| 130 |
### Why This Matters
|
| 131 |
-
- **Triage tool**:
|
| 132 |
-
- **Resource-constrained settings**:
|
| 133 |
-
- **Standardization**:
|
| 134 |
|
| 135 |
### Limitations
|
| 136 |
-
1. **
|
| 137 |
-
2. **
|
| 138 |
-
3. **No
|
| 139 |
-
4. **
|
| 140 |
-
5. **Regulatory**:
|
| 141 |
-
|
| 142 |
-
---
|
| 143 |
-
|
| 144 |
-
## Future Directions
|
| 145 |
-
|
| 146 |
-
1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score
|
| 147 |
-
2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning
|
| 148 |
-
3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization
|
| 149 |
-
4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions
|
| 150 |
-
5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection
|
| 151 |
|
| 152 |
---
|
| 153 |
|
|
@@ -164,9 +180,21 @@ print(result)
|
|
| 164 |
|
| 165 |
---
|
| 166 |
|
| 167 |
-
##
|
| 168 |
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
```bibtex
|
| 172 |
@misc{mlinter_thyroid_2026,
|
|
@@ -179,14 +207,4 @@ If you use this model or dataset in your research, please cite:
|
|
| 179 |
|
| 180 |
---
|
| 181 |
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
1. Duong et al. "ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset." MICCAI 2025.
|
| 185 |
-
2. "PEMV-Thyroid: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification." arXiv:2603.28315, 2025.
|
| 186 |
-
3. "EchoCare: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications." arXiv:2509.11752, 2025.
|
| 187 |
-
4. "Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis." arXiv:2602.01055, 2026.
|
| 188 |
-
5. ACR TI-RADS Guidelines: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/TI-RADS
|
| 189 |
-
|
| 190 |
-
---
|
| 191 |
-
|
| 192 |
-
*This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs with Trackio monitoring. Job ID: 69f951949d85bec4d76f2ae3*
|
|
|
|
| 2 |
|
| 3 |
## TL;DR
|
| 4 |
|
| 5 |
+
We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **96.4% accuracy, 98.7% ROC-AUC, 93.7% sensitivity, and 98.1% specificity** on the held-out test set β substantially exceeding published benchmarks.
|
| 6 |
|
| 7 |
- **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
|
| 8 |
- **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
|
| 9 |
- **Task**: Binary classification (benign vs malignant)
|
| 10 |
+
- **Architecture**: SwinV2-Base (86.9M parameters)
|
| 11 |
+
- **Test Set**: 499 samples (310 benign, 189 malignant)
|
| 12 |
+
|
| 13 |
+
**Key Clinical Metrics (Test Set):**
|
| 14 |
+
| Metric | Value |
|
| 15 |
+
|--------|-------|
|
| 16 |
+
| Accuracy | **96.4%** |
|
| 17 |
+
| AUC-ROC | **98.7%** |
|
| 18 |
+
| Sensitivity (Recall) | **93.7%** |
|
| 19 |
+
| Specificity | **98.1%** |
|
| 20 |
+
| PPV (Precision) | **96.7%** |
|
| 21 |
+
| NPV | **96.2%** |
|
| 22 |
+
| F1 Score | **96.4%** |
|
| 23 |
|
| 24 |
---
|
| 25 |
|
|
|
|
| 44 |
|
| 45 |
| Split | Images | Benign (0) | Malignant (1) |
|
| 46 |
|-------|--------|-----------|---------------|
|
| 47 |
+
| Train | 1,993 | 1,236 | 757 |
|
| 48 |
+
| Validation | 499 | 310 | 189 |
|
| 49 |
+
| Test (held-out) | 623 | 358 | 265 |
|
| 50 |
|
| 51 |
+
- **Modality**: Grayscale ultrasound
|
| 52 |
- **Image sizes**: Variable (~270Γ270 to ~510Γ370)
|
| 53 |
- **Class balance**: ~62% benign, ~38% malignant
|
| 54 |
|
| 55 |
+
We used stratified train_test_split (80/20) for train/validation. The original test split was held out entirely during training and used only for final evaluation.
|
| 56 |
|
| 57 |
---
|
| 58 |
|
|
|
|
| 65 |
3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features
|
| 66 |
4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks
|
| 67 |
|
|
|
|
|
|
|
| 68 |
### Training Configuration
|
| 69 |
|
| 70 |
| Hyperparameter | Value |
|
|
|
|
| 73 |
| Batch size | 16 per device |
|
| 74 |
| Gradient accumulation | 2 steps |
|
| 75 |
| Effective batch size | 32 |
|
| 76 |
+
| Epochs | 30 (early stopping patience=5) |
|
| 77 |
| Warmup steps | 100 |
|
| 78 |
| Weight decay | 0.01 |
|
| 79 |
| Optimizer | AdamW |
|
| 80 |
| Precision | bf16 |
|
| 81 |
| Augmentation | Random rotation (Β±10Β°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
|
| 82 |
+
| Metric for best model | ROC-AUC |
|
| 83 |
|
| 84 |
---
|
| 85 |
|
| 86 |
+
## Results
|
| 87 |
+
|
| 88 |
+
### Final Test Set Performance (Held-Out)
|
| 89 |
+
|
| 90 |
+
| Metric | Value | Clinical Interpretation |
|
| 91 |
+
|--------|-------|------------------------|
|
| 92 |
+
| **Accuracy** | **96.4%** | Overall correct prediction rate |
|
| 93 |
+
| **AUC-ROC** | **98.7%** | Discrimination between benign and malignant |
|
| 94 |
+
| **Sensitivity** | **93.7%** | 177 of 189 malignant nodules correctly identified (12 false negatives) |
|
| 95 |
+
| **Specificity** | **98.1%** | 304 of 310 benign nodules correctly identified (6 false positives) |
|
| 96 |
+
| **PPV** | **96.7%** | Of 183 flagged malignant, 177 were actually malignant |
|
| 97 |
+
| **NPV** | **96.2%** | Of 316 flagged benign, 304 were actually benign |
|
| 98 |
+
| **F1 Score** | **96.4%** | Harmonic mean of precision and recall |
|
| 99 |
+
|
| 100 |
+
**Confusion Matrix:**
|
| 101 |
+
```
|
| 102 |
+
Predicted
|
| 103 |
+
Benign Malignant
|
| 104 |
+
Benign 304 6
|
| 105 |
+
Malignant 12 177
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
**Per-Class Performance:**
|
| 109 |
+
| Class | Precision | Recall (Sensitivity) | F1 |
|
| 110 |
+
|-------|-----------|---------------------|-----|
|
| 111 |
+
| Benign | 96.2% | 98.1% | 97.1% |
|
| 112 |
+
| Malignant | 96.7% | 93.7% | 95.2% |
|
| 113 |
|
| 114 |
---
|
| 115 |
|
| 116 |
## Comparison with Published Benchmarks
|
| 117 |
|
| 118 |
+
| Model / Study | Year | Dataset | AUC | Accuracy | Sensitivity | Specificity | Notes |
|
| 119 |
+
|---------------|------|---------|-----|----------|-------------|-------------|-------|
|
| 120 |
+
| **Human Radiologists** | 2025 | 100 nodules | β | β | ~65% | ~20% | Published benchmark |
|
| 121 |
+
| **ResNet-18 Baseline** | 2025 | TN3K | β | ~80% | β | β | Standard CNN |
|
| 122 |
+
| **PEMV-Thyroid** | 2025 | TN3K | β | 82.08% | β | β | Multi-view ResNet-18 |
|
| 123 |
+
| **PEMV-Thyroid** | 2025 | TN5000 | β | 86.50% | β | β | Best public CNN |
|
| 124 |
+
| **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | β | β | β | Foundation model, 4.5M images |
|
| 125 |
+
| **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% | β | β | β | EfficientNet-B4 + FPN |
|
| 126 |
+
| **Ours (SwinV2)** | **2026** | **BTX24** | **98.7%** | **96.4%** | **93.7%** | **98.1%** | **Task-specific fine-tuning** |
|
| 127 |
|
| 128 |
### Key Observations
|
| 129 |
|
| 130 |
+
1. **Substantially surpasses EchoCare**: 98.7% vs 86.5% AUC despite ~100Γ less training data
|
| 131 |
+
2. **Exceeds FM_UIA baseline**: 98.7% vs 91.6% AUC
|
| 132 |
+
3. **Far exceeds radiologist sensitivity**: 93.7% vs ~65% published
|
| 133 |
+
4. **Excellent specificity**: 98.1% minimizes unnecessary biopsies
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## TN3K Cross-Dataset Evaluation
|
| 138 |
|
| 139 |
+
**The TN3K dataset (`haifan-gong/TN3K`) is a segmentation dataset**, not a classification dataset. It contains:
|
| 140 |
+
- Ultrasound images + **pixel-level nodule masks**
|
| 141 |
+
- Labels are `test-image` (0) and `test-mask` (1) β **no benign/malignant labels**
|
| 142 |
|
| 143 |
+
TN3K is designed for **nodule detection/segmentation** tasks. Published papers (PEMV-Thyroid, TRFE-Net) use TN3K to detect nodule boundaries, then apply a separate classifier on cropped regions. Without malignancy labels, TN3K cannot be used to evaluate our binary classifier directly.
|
| 144 |
|
| 145 |
+
**For true cross-dataset validation**, the following datasets would be needed:
|
| 146 |
+
- **TN5000**: 5,000 thyroid ultrasound images with classification labels (Nature Scientific Data 2025)
|
| 147 |
+
- **ThyroidXL**: Pathology-validated dataset with TI-RADS annotations (MICCAI 2025, gated)
|
| 148 |
+
- **Custom hospital dataset**: With histopathological confirmation
|
| 149 |
|
| 150 |
+
Scripts for cross-dataset evaluation are included in this repo (`cross_dataset_evaluation.py`).
|
| 151 |
|
| 152 |
---
|
| 153 |
|
| 154 |
## Clinical Relevance and Limitations
|
| 155 |
|
| 156 |
### Why This Matters
|
| 157 |
+
- **Triage tool**: High-sensitivity AI can flag suspicious nodules for priority review
|
| 158 |
+
- **Resource-constrained settings**: Extends expert-level screening to underserved regions
|
| 159 |
+
- **Standardization**: Reduces inter-reader variability in TI-RADS scoring
|
| 160 |
|
| 161 |
### Limitations
|
| 162 |
+
1. **Single dataset validation**: Only evaluated on BTX24; cross-dataset validation on TN5000/ThyroidXL needed
|
| 163 |
+
2. **Binary classification only**: Does not predict full TI-RADS score or individual features
|
| 164 |
+
3. **No pathology correlation**: Dataset labels may lack gold-standard histopathological confirmation
|
| 165 |
+
4. **Test-validation gap**: 98.7% test AUC vs 89.1% validation AUC suggests potential distribution differences
|
| 166 |
+
5. **Regulatory**: Research model only; not FDA/CE approved
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
|
| 168 |
---
|
| 169 |
|
|
|
|
| 180 |
|
| 181 |
---
|
| 182 |
|
| 183 |
+
## Repository Contents
|
| 184 |
|
| 185 |
+
| File | Description |
|
| 186 |
+
|------|-------------|
|
| 187 |
+
| `train_thyroid.py` | Full training script with SwinV2 fine-tuning |
|
| 188 |
+
| `evaluate_simple.py` | Test set evaluation (pure PyTorch, no Trainer) |
|
| 189 |
+
| `cross_dataset_evaluation.py` | Cross-dataset evaluation framework |
|
| 190 |
+
| `generate_gradcam_locally.py` | Grad-CAM visualization generator |
|
| 191 |
+
| `thyroid_metrics.json` | Complete test set metrics (JSON) |
|
| 192 |
+
| `blog_post.md` | Detailed technical blog post |
|
| 193 |
+
| `physician-guide.md` | Guide for clinicians replicating this workflow |
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
## Citation
|
| 198 |
|
| 199 |
```bibtex
|
| 200 |
@misc{mlinter_thyroid_2026,
|
|
|
|
| 207 |
|
| 208 |
---
|
| 209 |
|
| 210 |
+
*This project was developed as part of the ML-Intern program. Model: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts: [thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|