NHS Medical Letter Classifier

Fine-tuned DistilBERT (distilbert-base-uncased) for classifying OCR'd NHS medical clinic letters into 49 letter type categories.

Model Details

Parameter	Value
Base model	`distilbert-base-uncased`
Training samples	13,672
Classes	49
Epochs	6
Batch size	16
Learning rate	2e-5
Max sequence length	512 tokens
Cleanlab corrections	212 labels relabeled (1.6% of dataset)

How We Got Here: Experiment Journey

1. Baseline: TF-IDF + LinearSVC

Approach: TfidfVectorizer (unigram+bigram, 50k features) with CalibratedClassifierCV(LinearSVC)
Result: ~91% accuracy on the original label set
Takeaway: Strong baseline, but limited by bag-of-words representation

2. Label Merging (Critical Improvement)

Approach: Consolidated synonymous labels (e.g., "Nephrology" to "Renal", "Minor Illness Consultation" to "Pharmacy") and dropped ambiguous/administrative labels
Result: Accuracy jumped from ~91% to ~96%
Takeaway: Label quality matters more than model architecture. Reduced label set from ~51 to 49 meaningful categories

3. DistilBERT Baseline (Our Core Model)

Approach: Fine-tuned distilbert-base-uncased, 4 epochs, 512 tokens, 70/10/20 stratified split
Result: Top-1: 95.76% | Top-3: 98.06% | Top-5: 98.61%
Takeaway: Strong performance, established as the baseline for all further experiments

4. ClinicalBERT & BioClinicalBERT

Approach: Tested domain-specific models (medicalai/ClinicalBERT, emilyalsentzer/Bio_ClinicalBERT)
Result: Similar to DistilBERT (~95-96%), no meaningful improvement
Takeaway: General-purpose DistilBERT captures enough for this task; domain pre-training didn't help

5. Longformer (1024 tokens)

Approach: allenai/longformer-base-4096 at 1024 tokens with global attention on CLS, case-sensitive
Result: Comparable to DistilBERT at 512 tokens
Takeaway: Most discriminative information is in the first 512 tokens; longer context doesn't help

6. Hierarchical Architecture

Approach: Two-stage: DistilBERT body for CLS embeddings, per-clinic LogisticRegression heads. 51 fine labels mapped to 25 broad categories
Result: Did not outperform flat DistilBERT
Takeaway: The flat classification space works well; hierarchical routing adds complexity without benefit

7. LLM Relabeling (GPT-5-mini)

Approach: Used OpenAI Batch API to get GPT-5-mini to reclassify all 13,672 samples. Trained DistilBERT on LLM-assigned labels
Result: 86.22% vs original labels | 93.24% vs LLM labels (Top-1)
Takeaway: LLM agrees with original labels ~85.7% of the time. LLM labels are different but not better — the original clinical labels carry domain knowledge the LLM lacks

8. Consensus Relabeling

Approach: Only change labels where both BERT and GPT-5-mini agree the original label is wrong
Result: Only 4 out of 9,569 samples met the consensus criteria
Takeaway: BERT memorizes its training labels, so it almost never disagrees with originals on training data. Consensus is too strict

9. Soft Knowledge Distillation

Approach: Got GPT-5-mini top-5 predictions with confidence scores as soft labels. Trained with blended loss: alpha * CE(hard) + (1-alpha) * KL(soft || student), alpha=0.5
Result: Top-1: 95.32% (-0.44pp) | Top-3: 97.48% (-0.58pp)
Takeaway: LLM self-reported confidence scores are too noisy/uniform. Soft KL loss stayed flat at ~3.5. Would need actual logprobs for this to work

10. Cleanlab: Remove Mislabeled Samples

Approach: Confident learning (Northcutt et al. 2021). 3-fold cross-validation for out-of-sample probabilities, then find_label_issues() to detect mislabeled samples. Removed 142 flagged training samples and retrained
Result: Top-1: 95.90% (+0.14pp) | Top-3: 97.70% (-0.36pp)
Takeaway: Small top-1 gain, but removing ambiguous samples hurt ranked predictions. Manual inspection confirmed ~99% of flagged samples were genuinely mislabeled

11. Cleanlab: Relabel Instead of Remove

Approach: Same cleanlab detection, but replaced wrong labels with model's predicted label instead of removing samples
Result (vs original test labels): Top-1: 95.80% | Top-3: 97.92% | Top-5: 98.46%
Result (vs corrected test labels): Top-1: 98.06% | Top-3: 99.09% | Top-5: 99.38%
Takeaway: The ~2pp gap between original and corrected evaluation reveals that the remaining "errors" are mostly test set noise, not model mistakes. True model performance is ~98% top-1

12. Production Model (This Model)

Approach: Fresh 3-fold cleanlab on the entire dataset (13,672 samples). Found 212 mislabeled samples (1.6%), relabeled all. Trained on full corrected dataset for 6 epochs
Sanity check: 99.74% accuracy on training data (expected, since model saw all data)
Estimated true accuracy: ~98% top-1, ~99% top-3 based on corrected-label evaluation

Key Findings

Label quality > model architecture. Label merging (+5pp) and cleanlab corrections (+2pp true accuracy) had more impact than any model change
DistilBERT is sufficient. Domain-specific models (ClinicalBERT, BioClinicalBERT) and longer context (Longformer) didn't help
~1.6% of labels are wrong. Discharge summary (9.1%), Paediatrics (7.2%), and Physiotherapy (6.8%) are the noisiest classes
The model is better than naive metrics suggest. When evaluated against corrected labels, top-1 jumps from ~96% to ~98%

Labels (49 classes)

A&E
Ambulance Notification
Audiology
Bowel Cancer Screening
Breast Clinic
Cancer Screening
Cardiology
Colposcopy
Dermatology
Diabetes & Endocrine
Diet Services
Discharge summary
ENT
Echocardiogram
Elderly Care
Gastroenterology
General Surgery
Genetics
Haematology
INR
Immunology
Mammogram
Maternity
Maxillofacial
Mental Health
Neurology
Neurosurgery
Obstetrics & Gynaecology
Oncology
Ophthalmology
Orthopaedics
Out of Hours
Paediatrics
Pain Management
Pharmacy
Physiotherapy
Plastic Surgery
Radiology
Renal
Respiratory
Retinal Screening
Rheumatology
Sexual Health
Speech and Language Therapy
Stroke Services
Urgent Care Centre
Urology
Vascular
Walk in Centre

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch, json

model = AutoModelForSequenceClassification.from_pretrained("mansour94/kynoby-william-bert-classifier")
tokenizer = AutoTokenizer.from_pretrained("mansour94/kynoby-william-bert-classifier")

# Load label map
from huggingface_hub import hf_hub_download
label_map = json.load(open(hf_hub_download("mansour94/kynoby-william-bert-classifier", "label_map.json")))
id2label = {int(k): v for k, v in label_map["id2label"].items()}

text = "Dear Dr Smith, I am writing to inform you about the patient's ophthalmology appointment..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)

# Top-3 predictions
top3 = torch.topk(probs, 3)
for i in range(3):
    idx = top3.indices[0][i].item()
    conf = top3.values[0][i].item()
    print(f"  {id2label[idx]}: {conf:.1%}")

Downloads last month: 4

Safetensors

Model size

67M params

Tensor type

F32