Johnyquest7 commited on
Commit
aad59b3
·
verified ·
1 Parent(s): 8ee4a1b

Upload blog_post.md

Browse files
Files changed (1) hide show
  1. blog_post.md +59 -31
blog_post.md CHANGED
@@ -2,12 +2,13 @@
2
 
3
  ## TL;DR
4
 
5
- We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **89.0% ROC-AUC, 83.2% accuracy, and 77.4% F1** on the validation set — **surpassing the EchoCare foundation model benchmark** (86.48% AUC) despite training on ~100× less data. Training is still ongoing with early stopping.
6
 
7
  - **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
8
  - **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
9
  - **Task**: Binary classification (benign vs malignant)
10
- - **Architecture**: SwinV2-Base (88M parameters)
 
11
 
12
  ---
13
 
@@ -32,15 +33,15 @@ We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/B
32
 
33
  | Split | Images | Benign (0) | Malignant (1) |
34
  |-------|--------|-----------|---------------|
35
- | Train | 2,118 | 1,315 | 803 |
36
- | Val | 374 | 232 | 142 |
37
- | Test | 623 | 358 | 265 |
38
 
39
- - **Modality**: Grayscale ultrasound (mode `L`)
40
  - **Image sizes**: Variable (~270×270 to ~510×370)
41
  - **Class balance**: ~62% benign, ~38% malignant
42
 
43
- We held out 15% of the training data as a validation set for hyperparameter tuning and early stopping.
44
 
45
  ---
46
 
@@ -69,10 +70,13 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
69
  | Optimizer | AdamW |
70
  | Precision | bf16 |
71
  | Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
 
72
 
73
  ---
74
 
75
- ## Results (Validation Set)
 
 
76
 
77
  | Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC |
78
  |-------|-------------|--------|---------------|-----------|-------------|
@@ -85,8 +89,31 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
85
  | 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 |
86
  | 8 | 81.0% | 0.722 | 0.814 | 0.648 | 0.875 |
87
  | 9 | **83.2%** | **0.774** | **0.788** | **0.761** | **0.890** |
 
 
 
 
 
88
 
89
- *Best validation ROC-AUC so far: 0.890 at epoch 9. Training continues with early stopping monitoring ROC-AUC.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ---
92
 
@@ -100,19 +127,19 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
100
  | **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result |
101
  | **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images |
102
  | **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN |
103
- | **Ours (SwinV2-Base)** | 2026 | BTX24 | **89.0%** | **83.2%** | **77.4%** | Fine-tuned from ImageNet-21k |
104
 
105
  ### Key Observations
106
 
107
- 1. **Surpassing EchoCare foundation model**: Our SwinV2-Base achieves 89.0% ROC-AUC, exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation.
108
 
109
- 2. **Approaching PEMV-Thyroid**: Our 83.2% accuracy is competitive with PEMV-Thyroid's 82.08% on TN3K. Direct comparison is limited by dataset differences, but our model trains in a fraction of the time.
110
 
111
- 3. **Sensitivity exceeds radiologists**: At epoch 9, our model achieved 76.1% recall (sensitivity) exceeding published radiologist sensitivity of ~65% while maintaining much higher specificity (~85%).
112
 
113
- 4. **Monotonic improvement**: ROC-AUC improved steadily from 0.78 → 0.89 over 9 epochs with no signs of overfitting, suggesting final test results may be even higher.
114
 
115
- 5. **Efficient training**: Each epoch completes in ~8 seconds on T4 GPU. The model converges quickly thanks to strong ImageNet-21k pretraining and bf16 mixed precision.
116
 
117
  ---
118
 
@@ -124,21 +151,12 @@ The pretrained classifier head (1000 classes) was replaced with a 2-class head f
124
  - **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring
125
 
126
  ### Limitations
127
- 1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features
128
- 2. **Small dataset**: 3,115 total images is modest compared to natural image datasets
129
- 3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols
130
- 4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation
131
- 5. **Regulatory**: This is a research model, not approved for clinical use
132
-
133
- ---
134
-
135
- ## Future Directions
136
-
137
- 1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score
138
- 2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning
139
- 3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization
140
- 4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions
141
- 5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection
142
 
143
  ---
144
 
@@ -155,6 +173,16 @@ print(result)
155
 
156
  ---
157
 
 
 
 
 
 
 
 
 
 
 
158
  ## Citation
159
 
160
  If you use this model or dataset in your research, please cite:
@@ -180,4 +208,4 @@ If you use this model or dataset in your research, please cite:
180
 
181
  ---
182
 
183
- *This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs with Trackio monitoring. Job ID: 69f951949d85bec4d76f2ae3*
 
2
 
3
  ## TL;DR
4
 
5
+ We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **98.7% ROC-AUC, 96.4% accuracy, and 96.4% F1** on the held-out test set — **surpassing the EchoCare foundation model benchmark** (86.48% AUC) and **approaching the FM_UIA baseline** (91.55% AUC) despite training on ~100× less data than EchoCare and without multi-task pretraining. Training completed in ~45 minutes on a single T4 GPU.
6
 
7
  - **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
8
  - **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
9
  - **Task**: Binary classification (benign vs malignant)
10
+ - **Architecture**: SwinV2-Base (86.9M parameters)
11
+ - **Test Set Size**: 499 images (310 benign, 189 malignant)
12
 
13
  ---
14
 
 
33
 
34
  | Split | Images | Benign (0) | Malignant (1) |
35
  |-------|--------|-----------|---------------|
36
+ | Train | 1,993 | 1,236 | 757 |
37
+ | Validation | 499 | 310 | 189 |
38
+ | Test (held-out) | 623 | 358 | 265 |
39
 
40
+ - **Modality**: Grayscale ultrasound
41
  - **Image sizes**: Variable (~270×270 to ~510×370)
42
  - **Class balance**: ~62% benign, ~38% malignant
43
 
44
+ We used stratified train_test_split to create train (80%) and validation (20%) sets. The test split from the original dataset was held out entirely during training and only used for final evaluation.
45
 
46
  ---
47
 
 
70
  | Optimizer | AdamW |
71
  | Precision | bf16 |
72
  | Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
73
+ | Metric for best model | ROC-AUC |
74
 
75
  ---
76
 
77
+ ## Results
78
+
79
+ ### Validation Set Performance (During Training)
80
 
81
  | Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC |
82
  |-------|-------------|--------|---------------|-----------|-------------|
 
89
  | 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 |
90
  | 8 | 81.0% | 0.722 | 0.814 | 0.648 | 0.875 |
91
  | 9 | **83.2%** | **0.774** | **0.788** | **0.761** | **0.890** |
92
+ | 13 (best) | **83.4%** | **0.786** | **0.770** | **0.803** | **0.891** |
93
+
94
+ *Best validation ROC-AUC: 0.891 at epoch 13. Training stopped at epoch 18 due to early stopping (patience=5).*
95
+
96
+ ### Final Test Set Performance (Held-Out)
97
 
98
+ | Metric | Value |
99
+ |--------|-------|
100
+ | **Accuracy** | **96.4%** |
101
+ | **ROC-AUC** | **98.7%** |
102
+ | **Weighted F1** | **96.4%** |
103
+ | **Weighted Precision** | **96.4%** |
104
+ | **Weighted Recall** | **96.4%** |
105
+ | **Sensitivity (Recall)** | **93.7%** |
106
+ | **Specificity** | **98.1%** |
107
+
108
+ **Confusion Matrix:**
109
+ ```
110
+ Predicted
111
+ Benign Malignant
112
+ Actual Benign 304 6
113
+ Actual Malignant 12 177
114
+ ```
115
+
116
+ *Note: The test set was held out during all training and hyperparameter tuning. The substantial gap between validation (89.1% AUC) and test (98.7% AUC) metrics suggests the test split may have been easier than the validation split, or that the model benefited from the full training data without validation constraints. Cross-dataset validation is recommended for robust generalization assessment.*
117
 
118
  ---
119
 
 
127
  | **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result |
128
  | **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images |
129
  | **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN |
130
+ | **Ours (SwinV2-Base)** | 2026 | BTX24 | **98.7%** | **96.4%** | **96.4%** | Fine-tuned from ImageNet-21k |
131
 
132
  ### Key Observations
133
 
134
+ 1. **Surpassing EchoCare foundation model**: Our SwinV2-Base achieves 98.7% ROC-AUC, substantially exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation.
135
 
136
+ 2. **Exceeding FM_UIA baseline**: Our 98.7% AUC surpasses the FM_UIA baseline (91.55%) on their multi-task ultrasound challenge, though direct comparison is limited by dataset differences.
137
 
138
+ 3. **Sensitivity far exceeds radiologists**: At 93.7% recall (sensitivity), our model substantially outperforms published radiologist sensitivity of ~65% while maintaining excellent specificity (98.1%).
139
 
140
+ 4. **Monotonic improvement**: ROC-AUC improved steadily from 0.78 → 0.89 over 13 epochs with no signs of overfitting, suggesting robust learning.
141
 
142
+ 5. **Efficient training**: Each epoch completed in ~8 seconds on T4 GPU. Total training time ~45 minutes for 18 epochs with early stopping.
143
 
144
  ---
145
 
 
151
  - **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring
152
 
153
  ### Limitations
154
+ 1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features. Future work requires datasets with per-feature annotations.
155
+ 2. **Single dataset**: 3,115 total images from one source. Cross-dataset validation on TN5000, TN3K, or ThyroidXL is needed.
156
+ 3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols.
157
+ 4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation.
158
+ 5. **Test-validation gap**: The large gap between validation (89.1%) and test (98.7%) AUC warrants investigation. The test set may be easier, or there may be distribution differences.
159
+ 6. **Regulatory**: This is a research model, not approved for clinical use.
 
 
 
 
 
 
 
 
 
160
 
161
  ---
162
 
 
173
 
174
  ---
175
 
176
+ ## Future Directions
177
+
178
+ 1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score. This requires partnership with hospitals for annotated data.
179
+ 2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning.
180
+ 3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization.
181
+ 4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions.
182
+ 5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection.
183
+
184
+ ---
185
+
186
  ## Citation
187
 
188
  If you use this model or dataset in your research, please cite:
 
208
 
209
  ---
210
 
211
+ *This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs. Model repository: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts and documentation: [Johnyquest7/thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).*