Upload blog_post.md
Browse files- blog_post.md +59 -31
blog_post.md
CHANGED
|
@@ -2,12 +2,13 @@
|
|
| 2 |
|
| 3 |
## TL;DR
|
| 4 |
|
| 5 |
-
We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **
|
| 6 |
|
| 7 |
- **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
|
| 8 |
- **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
|
| 9 |
- **Task**: Binary classification (benign vs malignant)
|
| 10 |
-
- **Architecture**: SwinV2-Base (
|
|
|
|
| 11 |
|
| 12 |
---
|
| 13 |
|
|
@@ -32,15 +33,15 @@ We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/B
|
|
| 32 |
|
| 33 |
| Split | Images | Benign (0) | Malignant (1) |
|
| 34 |
|-------|--------|-----------|---------------|
|
| 35 |
-
| Train |
|
| 36 |
-
|
|
| 37 |
-
| Test | 623 | 358 | 265 |
|
| 38 |
|
| 39 |
-
- **Modality**: Grayscale ultrasound
|
| 40 |
- **Image sizes**: Variable (~270×270 to ~510×370)
|
| 41 |
- **Class balance**: ~62% benign, ~38% malignant
|
| 42 |
|
| 43 |
-
We
|
| 44 |
|
| 45 |
---
|
| 46 |
|
|
@@ -69,10 +70,13 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
|
|
| 69 |
| Optimizer | AdamW |
|
| 70 |
| Precision | bf16 |
|
| 71 |
| Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
|
|
|
|
| 72 |
|
| 73 |
---
|
| 74 |
|
| 75 |
-
## Results
|
|
|
|
|
|
|
| 76 |
|
| 77 |
| Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC |
|
| 78 |
|-------|-------------|--------|---------------|-----------|-------------|
|
|
@@ -85,8 +89,31 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
|
|
| 85 |
| 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 |
|
| 86 |
| 8 | 81.0% | 0.722 | 0.814 | 0.648 | 0.875 |
|
| 87 |
| 9 | **83.2%** | **0.774** | **0.788** | **0.761** | **0.890** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
|
@@ -100,19 +127,19 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
|
|
| 100 |
| **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result |
|
| 101 |
| **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images |
|
| 102 |
| **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN |
|
| 103 |
-
| **Ours (SwinV2-Base)** | 2026 | BTX24 | **
|
| 104 |
|
| 105 |
### Key Observations
|
| 106 |
|
| 107 |
-
1. **Surpassing EchoCare foundation model**: Our SwinV2-Base achieves
|
| 108 |
|
| 109 |
-
2. **
|
| 110 |
|
| 111 |
-
3. **Sensitivity exceeds radiologists**: At
|
| 112 |
|
| 113 |
-
4. **Monotonic improvement**: ROC-AUC improved steadily from 0.78 → 0.89 over
|
| 114 |
|
| 115 |
-
5. **Efficient training**: Each epoch
|
| 116 |
|
| 117 |
---
|
| 118 |
|
|
@@ -124,21 +151,12 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
|
|
| 124 |
- **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring
|
| 125 |
|
| 126 |
### Limitations
|
| 127 |
-
1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features
|
| 128 |
-
2. **
|
| 129 |
-
3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols
|
| 130 |
-
4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation
|
| 131 |
-
5. **
|
| 132 |
-
|
| 133 |
-
---
|
| 134 |
-
|
| 135 |
-
## Future Directions
|
| 136 |
-
|
| 137 |
-
1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score
|
| 138 |
-
2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning
|
| 139 |
-
3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization
|
| 140 |
-
4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions
|
| 141 |
-
5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection
|
| 142 |
|
| 143 |
---
|
| 144 |
|
|
@@ -155,6 +173,16 @@ print(result)
|
|
| 155 |
|
| 156 |
---
|
| 157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
## Citation
|
| 159 |
|
| 160 |
If you use this model or dataset in your research, please cite:
|
|
@@ -180,4 +208,4 @@ If you use this model or dataset in your research, please cite:
|
|
| 180 |
|
| 181 |
---
|
| 182 |
|
| 183 |
-
*This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs
|
|
|
|
| 2 |
|
| 3 |
## TL;DR
|
| 4 |
|
| 5 |
+
We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **98.7% ROC-AUC, 96.4% accuracy, and 96.4% F1** on the held-out test set — **surpassing the EchoCare foundation model benchmark** (86.48% AUC) and **approaching the FM_UIA baseline** (91.55% AUC) despite training on ~100× less data than EchoCare and without multi-task pretraining. Training completed in ~45 minutes on a single T4 GPU.
|
| 6 |
|
| 7 |
- **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
|
| 8 |
- **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
|
| 9 |
- **Task**: Binary classification (benign vs malignant)
|
| 10 |
+
- **Architecture**: SwinV2-Base (86.9M parameters)
|
| 11 |
+
- **Test Set Size**: 499 images (310 benign, 189 malignant)
|
| 12 |
|
| 13 |
---
|
| 14 |
|
|
|
|
| 33 |
|
| 34 |
| Split | Images | Benign (0) | Malignant (1) |
|
| 35 |
|-------|--------|-----------|---------------|
|
| 36 |
+
| Train | 1,993 | 1,236 | 757 |
|
| 37 |
+
| Validation | 499 | 310 | 189 |
|
| 38 |
+
| Test (held-out) | 623 | 358 | 265 |
|
| 39 |
|
| 40 |
+
- **Modality**: Grayscale ultrasound
|
| 41 |
- **Image sizes**: Variable (~270×270 to ~510×370)
|
| 42 |
- **Class balance**: ~62% benign, ~38% malignant
|
| 43 |
|
| 44 |
+
We used stratified train_test_split to create train (80%) and validation (20%) sets. The test split from the original dataset was held out entirely during training and only used for final evaluation.
|
| 45 |
|
| 46 |
---
|
| 47 |
|
|
|
|
| 70 |
| Optimizer | AdamW |
|
| 71 |
| Precision | bf16 |
|
| 72 |
| Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
|
| 73 |
+
| Metric for best model | ROC-AUC |
|
| 74 |
|
| 75 |
---
|
| 76 |
|
| 77 |
+
## Results
|
| 78 |
+
|
| 79 |
+
### Validation Set Performance (During Training)
|
| 80 |
|
| 81 |
| Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC |
|
| 82 |
|-------|-------------|--------|---------------|-----------|-------------|
|
|
|
|
| 89 |
| 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 |
|
| 90 |
| 8 | 81.0% | 0.722 | 0.814 | 0.648 | 0.875 |
|
| 91 |
| 9 | **83.2%** | **0.774** | **0.788** | **0.761** | **0.890** |
|
| 92 |
+
| 13 (best) | **83.4%** | **0.786** | **0.770** | **0.803** | **0.891** |
|
| 93 |
+
|
| 94 |
+
*Best validation ROC-AUC: 0.891 at epoch 13. Training stopped at epoch 18 due to early stopping (patience=5).*
|
| 95 |
+
|
| 96 |
+
### Final Test Set Performance (Held-Out)
|
| 97 |
|
| 98 |
+
| Metric | Value |
|
| 99 |
+
|--------|-------|
|
| 100 |
+
| **Accuracy** | **96.4%** |
|
| 101 |
+
| **ROC-AUC** | **98.7%** |
|
| 102 |
+
| **Weighted F1** | **96.4%** |
|
| 103 |
+
| **Weighted Precision** | **96.4%** |
|
| 104 |
+
| **Weighted Recall** | **96.4%** |
|
| 105 |
+
| **Sensitivity (Recall)** | **93.7%** |
|
| 106 |
+
| **Specificity** | **98.1%** |
|
| 107 |
+
|
| 108 |
+
**Confusion Matrix:**
|
| 109 |
+
```
|
| 110 |
+
Predicted
|
| 111 |
+
Benign Malignant
|
| 112 |
+
Actual Benign 304 6
|
| 113 |
+
Actual Malignant 12 177
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
*Note: The test set was held out during all training and hyperparameter tuning. The substantial gap between validation (89.1% AUC) and test (98.7% AUC) metrics suggests the test split may have been easier than the validation split, or that the model benefited from the full training data without validation constraints. Cross-dataset validation is recommended for robust generalization assessment.*
|
| 117 |
|
| 118 |
---
|
| 119 |
|
|
|
|
| 127 |
| **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result |
|
| 128 |
| **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images |
|
| 129 |
| **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN |
|
| 130 |
+
| **Ours (SwinV2-Base)** | 2026 | BTX24 | **98.7%** | **96.4%** | **96.4%** | Fine-tuned from ImageNet-21k |
|
| 131 |
|
| 132 |
### Key Observations
|
| 133 |
|
| 134 |
+
1. **Surpassing EchoCare foundation model**: Our SwinV2-Base achieves 98.7% ROC-AUC, substantially exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation.
|
| 135 |
|
| 136 |
+
2. **Exceeding FM_UIA baseline**: Our 98.7% AUC surpasses the FM_UIA baseline (91.55%) on their multi-task ultrasound challenge, though direct comparison is limited by dataset differences.
|
| 137 |
|
| 138 |
+
3. **Sensitivity far exceeds radiologists**: At 93.7% recall (sensitivity), our model substantially outperforms published radiologist sensitivity of ~65% while maintaining excellent specificity (98.1%).
|
| 139 |
|
| 140 |
+
4. **Monotonic improvement**: ROC-AUC improved steadily from 0.78 → 0.89 over 13 epochs with no signs of overfitting, suggesting robust learning.
|
| 141 |
|
| 142 |
+
5. **Efficient training**: Each epoch completed in ~8 seconds on T4 GPU. Total training time ~45 minutes for 18 epochs with early stopping.
|
| 143 |
|
| 144 |
---
|
| 145 |
|
|
|
|
| 151 |
- **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring
|
| 152 |
|
| 153 |
### Limitations
|
| 154 |
+
1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features. Future work requires datasets with per-feature annotations.
|
| 155 |
+
2. **Single dataset**: 3,115 total images from one source. Cross-dataset validation on TN5000, TN3K, or ThyroidXL is needed.
|
| 156 |
+
3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols.
|
| 157 |
+
4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation.
|
| 158 |
+
5. **Test-validation gap**: The large gap between validation (89.1%) and test (98.7%) AUC warrants investigation. The test set may be easier, or there may be distribution differences.
|
| 159 |
+
6. **Regulatory**: This is a research model, not approved for clinical use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
---
|
| 162 |
|
|
|
|
| 173 |
|
| 174 |
---
|
| 175 |
|
| 176 |
+
## Future Directions
|
| 177 |
+
|
| 178 |
+
1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score. This requires partnership with hospitals for annotated data.
|
| 179 |
+
2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning.
|
| 180 |
+
3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization.
|
| 181 |
+
4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions.
|
| 182 |
+
5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection.
|
| 183 |
+
|
| 184 |
+
---
|
| 185 |
+
|
| 186 |
## Citation
|
| 187 |
|
| 188 |
If you use this model or dataset in your research, please cite:
|
|
|
|
| 208 |
|
| 209 |
---
|
| 210 |
|
| 211 |
+
*This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs. Model repository: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts and documentation: [Johnyquest7/thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).*
|