NHS Medical Letter Classifier

Fine-tuned DistilBERT (distilbert-base-uncased) for classifying OCR'd NHS medical clinic letters into 49 letter type categories.

Model Details

Parameter Value
Base model distilbert-base-uncased
Training samples 13,672
Classes 49
Epochs 6
Batch size 16
Learning rate 2e-5
Max sequence length 512 tokens
Cleanlab corrections 212 labels relabeled (1.6% of dataset)

How We Got Here: Experiment Journey

1. Baseline: TF-IDF + LinearSVC

  • Approach: TfidfVectorizer (unigram+bigram, 50k features) with CalibratedClassifierCV(LinearSVC)
  • Result: ~91% accuracy on the original label set
  • Takeaway: Strong baseline, but limited by bag-of-words representation

2. Label Merging (Critical Improvement)

  • Approach: Consolidated synonymous labels (e.g., "Nephrology" to "Renal", "Minor Illness Consultation" to "Pharmacy") and dropped ambiguous/administrative labels
  • Result: Accuracy jumped from ~91% to ~96%
  • Takeaway: Label quality matters more than model architecture. Reduced label set from ~51 to 49 meaningful categories

3. DistilBERT Baseline (Our Core Model)

  • Approach: Fine-tuned distilbert-base-uncased, 4 epochs, 512 tokens, 70/10/20 stratified split
  • Result: Top-1: 95.76% | Top-3: 98.06% | Top-5: 98.61%
  • Takeaway: Strong performance, established as the baseline for all further experiments

4. ClinicalBERT & BioClinicalBERT

  • Approach: Tested domain-specific models (medicalai/ClinicalBERT, emilyalsentzer/Bio_ClinicalBERT)
  • Result: Similar to DistilBERT (~95-96%), no meaningful improvement
  • Takeaway: General-purpose DistilBERT captures enough for this task; domain pre-training didn't help

5. Longformer (1024 tokens)

  • Approach: allenai/longformer-base-4096 at 1024 tokens with global attention on CLS, case-sensitive
  • Result: Comparable to DistilBERT at 512 tokens
  • Takeaway: Most discriminative information is in the first 512 tokens; longer context doesn't help

6. Hierarchical Architecture

  • Approach: Two-stage: DistilBERT body for CLS embeddings, per-clinic LogisticRegression heads. 51 fine labels mapped to 25 broad categories
  • Result: Did not outperform flat DistilBERT
  • Takeaway: The flat classification space works well; hierarchical routing adds complexity without benefit

7. LLM Relabeling (GPT-5-mini)

  • Approach: Used OpenAI Batch API to get GPT-5-mini to reclassify all 13,672 samples. Trained DistilBERT on LLM-assigned labels
  • Result: 86.22% vs original labels | 93.24% vs LLM labels (Top-1)
  • Takeaway: LLM agrees with original labels ~85.7% of the time. LLM labels are different but not better — the original clinical labels carry domain knowledge the LLM lacks

8. Consensus Relabeling

  • Approach: Only change labels where both BERT and GPT-5-mini agree the original label is wrong
  • Result: Only 4 out of 9,569 samples met the consensus criteria
  • Takeaway: BERT memorizes its training labels, so it almost never disagrees with originals on training data. Consensus is too strict

9. Soft Knowledge Distillation

  • Approach: Got GPT-5-mini top-5 predictions with confidence scores as soft labels. Trained with blended loss: alpha * CE(hard) + (1-alpha) * KL(soft || student), alpha=0.5
  • Result: Top-1: 95.32% (-0.44pp) | Top-3: 97.48% (-0.58pp)
  • Takeaway: LLM self-reported confidence scores are too noisy/uniform. Soft KL loss stayed flat at ~3.5. Would need actual logprobs for this to work

10. Cleanlab: Remove Mislabeled Samples

  • Approach: Confident learning (Northcutt et al. 2021). 3-fold cross-validation for out-of-sample probabilities, then find_label_issues() to detect mislabeled samples. Removed 142 flagged training samples and retrained
  • Result: Top-1: 95.90% (+0.14pp) | Top-3: 97.70% (-0.36pp)
  • Takeaway: Small top-1 gain, but removing ambiguous samples hurt ranked predictions. Manual inspection confirmed ~99% of flagged samples were genuinely mislabeled

11. Cleanlab: Relabel Instead of Remove

  • Approach: Same cleanlab detection, but replaced wrong labels with model's predicted label instead of removing samples
  • Result (vs original test labels): Top-1: 95.80% | Top-3: 97.92% | Top-5: 98.46%
  • Result (vs corrected test labels): Top-1: 98.06% | Top-3: 99.09% | Top-5: 99.38%
  • Takeaway: The ~2pp gap between original and corrected evaluation reveals that the remaining "errors" are mostly test set noise, not model mistakes. True model performance is ~98% top-1

12. Production Model (This Model)

  • Approach: Fresh 3-fold cleanlab on the entire dataset (13,672 samples). Found 212 mislabeled samples (1.6%), relabeled all. Trained on full corrected dataset for 6 epochs
  • Sanity check: 99.74% accuracy on training data (expected, since model saw all data)
  • Estimated true accuracy: ~98% top-1, ~99% top-3 based on corrected-label evaluation

Key Findings

  1. Label quality > model architecture. Label merging (+5pp) and cleanlab corrections (+2pp true accuracy) had more impact than any model change
  2. DistilBERT is sufficient. Domain-specific models (ClinicalBERT, BioClinicalBERT) and longer context (Longformer) didn't help
  3. ~1.6% of labels are wrong. Discharge summary (9.1%), Paediatrics (7.2%), and Physiotherapy (6.8%) are the noisiest classes
  4. The model is better than naive metrics suggest. When evaluated against corrected labels, top-1 jumps from ~96% to ~98%

Labels (49 classes)

  • A&E
  • Ambulance Notification
  • Audiology
  • Bowel Cancer Screening
  • Breast Clinic
  • Cancer Screening
  • Cardiology
  • Colposcopy
  • Dermatology
  • Diabetes & Endocrine
  • Diet Services
  • Discharge summary
  • ENT
  • Echocardiogram
  • Elderly Care
  • Gastroenterology
  • General Surgery
  • Genetics
  • Haematology
  • INR
  • Immunology
  • Mammogram
  • Maternity
  • Maxillofacial
  • Mental Health
  • Neurology
  • Neurosurgery
  • Obstetrics & Gynaecology
  • Oncology
  • Ophthalmology
  • Orthopaedics
  • Out of Hours
  • Paediatrics
  • Pain Management
  • Pharmacy
  • Physiotherapy
  • Plastic Surgery
  • Radiology
  • Renal
  • Respiratory
  • Retinal Screening
  • Rheumatology
  • Sexual Health
  • Speech and Language Therapy
  • Stroke Services
  • Urgent Care Centre
  • Urology
  • Vascular
  • Walk in Centre

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch, json

model = AutoModelForSequenceClassification.from_pretrained("mansour94/kynoby-william-bert-classifier")
tokenizer = AutoTokenizer.from_pretrained("mansour94/kynoby-william-bert-classifier")

# Load label map
from huggingface_hub import hf_hub_download
label_map = json.load(open(hf_hub_download("mansour94/kynoby-william-bert-classifier", "label_map.json")))
id2label = {int(k): v for k, v in label_map["id2label"].items()}

text = "Dear Dr Smith, I am writing to inform you about the patient's ophthalmology appointment..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)

# Top-3 predictions
top3 = torch.topk(probs, 3)
for i in range(3):
    idx = top3.indices[0][i].item()
    conf = top3.values[0][i].item()
    print(f"  {id2label[idx]}: {conf:.1%}")
Downloads last month
56
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support