🖋️ Handwriting Recognition with Deep Learning

A complete end-to-end handwriting recognition system using CNN-BiLSTM-CTC architecture

🎯 Model • 📊 Dataset Analysis • 🏗️ Architecture • 📈 Performance • 🚀 Quick Start

🎯 Overview

This project implements a state-of-the-art Handwriting Recognition system that converts handwritten text images into digital text. The model achieves 87% character-level accuracy on the IAM Handwriting Database.

Key Highlights

✅ CNN-BiLSTM-CTC Architecture - Industry-standard OCR architecture
✅ 9.1M Parameters - Efficient yet powerful model
✅ CER: 12.95% - High character recognition accuracy
✅ IAM Dataset - 10,000+ handwritten text samples
✅ Google Colab Compatible - Train on free GPU
✅ Production Ready - Complete inference pipeline

🔗 Resources

Resource	Link	Description
🤗 Trained Model	IsmatS/handwriting-recognition-iam	Pre-trained weights (105MB)
📦 Dataset	Teklia/IAM-line	IAM Handwriting Database
📓 Training Notebook	`train_colab.ipynb`	Full training pipeline
📊 Analysis Notebook	`analysis.ipynb`	Dataset exploration

📊 Dataset Insights

The IAM Handwriting Database is one of the most widely-used datasets for handwriting recognition research. Here's what we discovered:

Dataset Statistics

Split	Samples	Usage
Train	6,482	Model training
Validation	976	Hyperparameter tuning
Test	2,915	Final evaluation
Total	10,373	Complete dataset

📸 Sample Images

Real handwritten text samples from the dataset:

Observations:

✍️ Diverse writing styles (cursive, print, mixed)
📏 Variable text lengths (10-100+ characters)
🎨 Different pen types and ink intensity
📐 Natural variations in slant and spacing

📏 Text Length Distribution

Key Insights:

📊 Mean length: ~48-60 characters per line
📈 Peak: 40-70 character range (most common)
🔢 Range: 5-150 characters
🎯 Implication: Model must handle variable-length sequences efficiently

Why this matters: The CTC (Connectionist Temporal Classification) loss function in our model is specifically designed to handle this variability without requiring character-level alignment annotations.

📐 Image Dimensions Analysis

Dimensional Characteristics:

Metric	Width	Height	Aspect Ratio
Mean	~400-500px	~50-100px	~6-8:1
Min	~100px	~30px	~3:1
Max	~1200px	~150px	~15:1

Engineering Decision:

🔄 Fixed height: Resize to 128px (preserves vertical features)
📏 Variable width: Maintain aspect ratio (prevents distortion)
🎯 Result: Preserves legibility while standardizing input

🔤 Character Frequency Analysis

Character Distribution:

🔡 Lowercase dominates: 'e', 't', 'a', 'o', 'n' (English frequency)
🔠 Capitals less common: Sentence beginnings, proper nouns
🔢 Numbers rare: Limited numeric content
⚙️ Punctuation: Periods, commas most frequent

Implications:

📚 74 unique characters: a-z, A-Z, 0-9, space, punctuation
⚖️ Class imbalance: Model sees more common characters
🎓 Training strategy: No special balancing needed (mirrors real-world text)

📋 Summary Statistics

Complete Statistical Overview:

📊 Min/Max/Mean for all features
📈 Standard deviations
🎯 Quartile distributions
🔍 Outlier detection

🏗️ Model Architecture

Our CNN-BiLSTM-CTC architecture combines three powerful components:

Input Image (128 x Variable Width)
           ↓
    ┌──────────────┐
    │  CNN Layers  │  ← Extract visual features
    │   (7 blocks) │     (edges, strokes, characters)
    └──────────────┘
           ↓
    Feature Maps (512 channels)
           ↓
    ┌──────────────┐
    │   BiLSTM     │  ← Model sequential dependencies
    │  (2 layers)  │     (left-to-right + right-to-left)
    └──────────────┘
           ↓
    ┌──────────────┐
    │ CTC Decoder  │  ← Alignment-free decoding
    │  (75 chars)  │     (handles variable lengths)
    └──────────────┘
           ↓
    Predicted Text

Component Breakdown

1️⃣ CNN Feature Extractor (7 Convolutional Blocks)

Block	Layers	Output Channels	Purpose
1	Conv + BN + ReLU + MaxPool	64	Basic edge detection
2	Conv + BN + ReLU + MaxPool	128	Stroke patterns
3	Conv + BN + ReLU	256	Character components
4	Conv + BN + ReLU + MaxPool(2,1)	256	Horizontal compression
5	Conv + BN + ReLU	512	Complex features
6	Conv + BN + ReLU + MaxPool(2,1)	512	Further compression
7	Conv + BN + ReLU	512	Final features

Key Design Choices:

Design Decision	Rationale
Batch Normalization	Normalizes activations → faster training, prevents internal covariate shift
Asymmetric pooling (2,1)	Compress height but preserve width → maintains character boundaries
Progressive channels (64→512)	More filters = richer features at deeper layers
No pooling in Conv 3,5	Maintains spatial resolution for detail preservation

Why Asymmetric MaxPool (2,1)?

Regular MaxPool (2,2):
  Image: [128, 400] → [64, 200] → [32, 100] → [16, 50]
  Problem: Loses too much horizontal resolution ❌
  Result: Character boundaries blur together

Asymmetric MaxPool (2,1):
  Image: [128, 400] → [64, 400] → [32, 400] → [16, 400]
  Benefit: Preserves horizontal details ✅
  Result: Each character remains distinct

2️⃣ Bidirectional LSTM (Sequence Modeling)

Configuration:
- Input Size: 256
- Hidden Size: 256
- Num Layers: 2
- Bidirectional: Yes (512 output)
- Dropout: 0.3

Why BiLSTM?

⬅️ Forward pass: Reads left-to-right (like humans)
➡️ Backward pass: Reads right-to-left (context from future)
🔄 Combined: Each character sees full sentence context

3️⃣ CTC Loss (Alignment-Free Training)

Advantages:

🎯 No character-level position labels needed
📏 Handles variable-length input/output
🔄 Learns temporal alignment automatically
✅ Industry standard for OCR/speech recognition

Total Parameters: 9,139,147 (~9.1M)

🔍 Deep Dive: How the Model Works

Step-by-Step Processing Pipeline

1. Image Input Processing

Original Image: "Hello" (handwritten)
      ↓
Resize: Height=128px, Width proportional
      ↓
Normalize: Pixel values from [0,255] → [-1,1]
      ↓
Tensor Shape: [Batch=1, Channels=1, Height=128, Width=W]

2. CNN Feature Extraction

The CNN progressively extracts hierarchical visual features:

Layer Type	What It Detects	Example
Conv1-2 (64-128 ch)	Edges, lines, curves	Vertical strokes, horizontal bars
Conv3-4 (256 ch)	Stroke combinations	Letter parts: tops of 't', loops in 'e'
Conv5-7 (512 ch)	Character-level features	Distinguish 'o' from 'a', 'n' from 'h'

Output: Feature map of shape [Batch, 512, 7, W_reduced]

Height reduced: 128 → 7 (18x compression)
Width reduced: ~W → W/4 (4x compression)
Channels increased: 1 → 512 (rich features)

3. Sequence-to-Sequence Mapping

# Convert 2D feature map to 1D sequence
Feature Map: [B, 512, 7, W/4]
      ↓
Reshape: [B, W/4, 512*7] = [B, W/4, 3584]
      ↓
Linear Layer: [B, W/4, 3584] → [B, W/4, 256]

Now we have a temporal sequence where each time step represents a horizontal segment of the image.

4. BiLSTM Sequential Modeling

Time step t:
  Forward LSTM →  Reads: "H" "e" "l" "l" "o"
  Backward LSTM ← Reads: "o" "l" "l" "e" "H"
                    ↓
  Concatenate: [forward_256, backward_256] = 512
                    ↓
  Context-aware representation for each character

Why bidirectional matters:

Forward: "H" knows it's at the start of a word
Backward: "H" knows "ello" comes after it
Combined: Better prediction accuracy

5. CTC Decoding

LSTM Output: [B, W/4, 512]
      ↓
Linear: [B, W/4, 512] → [B, W/4, 75]  (75 = 74 chars + blank)
      ↓
Softmax: Probability distribution over characters
      ↓
CTC Decode: Remove blanks and duplicates

Example CTC Alignment:

Model output (frame by frame):
[-, -, H, H, H, -, e, e, -, l, l, l, -, l, -, o, o, -, -]

CTC decoding:
- Remove blanks (-)
- Collapse repeats
Result: "Hello" ✅

📐 Understanding the Metrics

CER (Character Error Rate)

CER measures the edit distance at character level using Levenshtein distance.

Formula:

CER = (Insertions + Deletions + Substitutions) / Total_Characters_in_Ground_Truth

Example Calculation:

Ground Truth	Prediction	Operations	CER
`hello` (5 chars)	`helo`	1 deletion ('l')	1/5 = 20%
`hello` (5 chars)	`hallo`	1 substitution ('e'→'a')	1/5 = 20%
`hello` (5 chars)	`helloo`	1 insertion ('o')	1/6 = 16.7%
`hello` (5 chars)	`hello`	0 errors	0/5 = 0% ✅

Our Model Performance:

CER = 12.95%

Example with 100 characters:
- Ground truth: 100 characters
- Errors: ~13 character mistakes
- Correct: ~87 characters ✅

Character-level accuracy: 87.05%

What CER tells us:

✅ Lower is better (0% = perfect)
✅ Character-by-character accuracy
✅ Sensitive to small mistakes
✅ Good for measuring overall quality

WER (Word Error Rate)

WER measures the edit distance at word level.

Formula:

WER = (Word_Insertions + Word_Deletions + Word_Substitutions) / Total_Words_in_Ground_Truth

Example Calculation:

Ground Truth	Prediction	Word Errors	WER
`hello world` (2 words)	`helo world`	1 error ('hello'→'helo')	1/2 = 50%
`hello world` (2 words)	`hello world`	0 errors	0/2 = 0% ✅
`the quick brown fox` (4 words)	`the quik brown fox`	1 error ('quick'→'quik')	1/4 = 25%

Our Model Performance:

WER = 42.47%

Example with 100 words:
- Ground truth: 100 words
- Word errors: ~42 words have at least 1 character wrong
- Correct words: ~58 words ✅

Word-level accuracy: 57.53%

Why WER > CER?

One character error corrupts the entire word:

Ground Truth: "The magnificent castle stood tall"
Prediction:   "The magnifcent castle stood tall"
                        ↑ missing 'i'

Character errors: 1
Word errors: 1 (entire word "magnificent" is wrong)

CER = 1/34 = 2.9%
WER = 1/5 = 20%  ← Much higher!

What WER tells us:

✅ More strict than CER
✅ Real-world usability measure
✅ High WER with low CER = mostly correct characters but words incomplete
⚠️ Can be harsh on OCR systems

CTC Loss

The loss function used during training.

What is CTC Loss?

Connectionist Temporal Classification (CTC) solves the alignment problem in sequence-to-sequence tasks.

The Problem CTC Solves:

Traditional approaches need exact character positions:

Image: "Hello"
Required labels:
- 'H' at pixels 0-20
- 'e' at pixels 21-35
- 'l' at pixels 36-50
- 'l' at pixels 51-65
- 'o' at pixels 66-80

This is impossible to annotate for handwriting!

CTC Solution:

Just provide the text: "Hello" ✅

CTC figures out the alignment automatically:

Input Frames:  |---|---|---|---|---|---|---|---|---|
Model Output:  | - | H | H | e | - | l | l | o | - |
                 ↓   ↓   ↓   ↓   ↓   ↓   ↓   ↓   ↓
CTC Decoding:  Remove blanks (-) and collapse repeats
Result:        "Hello" ✅

How CTC Training Works:

Blank token (ε): Special symbol for "no character"
Multiple alignments: Many ways to align same text
Sum probabilities: CTC sums all valid alignments

Example:

Target: "Hi"

Valid alignments:
- [H, i, -, -]
- [-, H, i, -]
- [H, H, i, i]
- [-, H, -, i]
... many more!

CTC Loss = -log(sum of probabilities of all valid paths)

Why CTC is Powerful:

✅ No alignment needed: Just text labels ✅ Handles variable lengths: Input 100 frames → Output 5 characters ✅ Robust: Learns best alignment automatically ✅ Standard: Used in speech recognition, OCR, handwriting

CTC During Inference:

# Model outputs probabilities for each frame
output = model(image)  # Shape: [time_steps, batch, num_chars]

# Greedy decoding (simple approach)
best_path = torch.argmax(output, dim=2)  # Pick most likely char per frame
# Example: [-, -, H, H, e, e, -, l, l, l, o, -]

# CTC collapse
result = collapse_repeats_and_remove_blanks(best_path)
# Result: "Hello"

Advanced: Beam Search Decoding

Instead of greedy (picking top-1), beam search keeps top-K possibilities:

More accurate but slower
Can incorporate language models
Used in production systems

🎯 Model Performance Analysis

Accuracy by Character Type

Based on validation results, approximate accuracy:

Character Type	Accuracy	Notes
Lowercase (a-z)	~90%	Most common, well-learned
Uppercase (A-Z)	~85%	Less training data
Digits (0-9)	~80%	Rare in dataset
Space	~95%	Easy to detect
Punctuation (.,'")	~75%	Often confused or missed

Common Confusions

Based on error analysis:

Ground Truth	Often Predicted As	Reason
`e`	`c`, `o`	Similar circular shapes
`n`	`u`, `r`	Stroke similarity
`a`	`o`, `e`	Loop closure ambiguity
`i`	`l`, `t`	Vertical strokes
`rn`	`m`	Combined strokes look like 'm'
`cl`	`d`	Close proximity → merged

Mitigation Strategies:

🔄 Data augmentation focusing on confusable pairs
📚 Language model post-processing (spell check)
🎯 Attention mechanisms to focus on character boundaries

📈 Training Results

Training Configuration

Hyperparameter	Value	Why This Value?
Epochs	10	Sweet spot for convergence; more epochs show diminishing returns
Batch Size	8	Balanced: Large enough for stable gradients, small enough for GPU memory
Learning Rate	0.001	Standard Adam LR; reduced automatically by scheduler if plateauing
Optimizer	Adam	Adaptive learning rates per parameter; industry standard
Scheduler	ReduceLROnPlateau	Reduces LR by 50% if validation loss doesn't improve for 3 epochs
Gradient Clip	5.0	Prevents exploding gradients common in RNNs/LSTMs
Image Height	128px	Balance between detail preservation and computational efficiency
Dropout	0.3	Regularization to prevent overfitting in LSTM layers

Hyperparameter Rationale

Why Batch Size = 8?

Larger batch (16+):
  ✅ Faster training
  ❌ Requires more GPU memory
  ❌ Less gradient noise (can hurt generalization)

Smaller batch (4-):
  ✅ Fits in memory easily
  ✅ More gradient noise (better generalization)
  ❌ Slower training
  ❌ Unstable gradients

Batch=8: Sweet spot ✅

Why Gradient Clipping = 5.0?

LSTMs are prone to exploding gradients:

Without clipping:
  Gradient = 10,000 → Model diverges ❌

With clipping (max norm = 5.0):
  Gradient = 10,000 → Scaled down to 5.0 ✅
  Training remains stable

Why ReduceLROnPlateau Scheduler?

Automatically adjusts learning rate when training stalls:

Epoch 1-5: LR = 0.001 (loss decreasing rapidly)
Epoch 6-8: LR = 0.001 (loss plateau detected)
Epoch 9+:  LR = 0.0005 (scheduler reduces by 50%)
           → Enables fine-tuning ✅

Training Progress

Convergence Analysis:

Epoch	Train Loss	Val Loss	CER ↓	WER ↓	Status
1	3.2065	2.6728	100.0%	100.0%	Random init
2	1.6866	1.0331	29.3%	71.8%	⚡ Rapid learning
5	0.6004	0.5655	17.7%	53.1%	🎯 Good progress
7	0.4868	0.4595	14.4%	46.5%	📊 Stable
10	0.3923	0.3836	12.95%	42.5%	✅ Best

Final Metrics

Metric	Value	Interpretation
Character Error Rate (CER)	12.95%	🎯 87% characters correct
Word Error Rate (WER)	42.47%	✅ 57.5% words correct
Training Time	~20 minutes	⚡ On T4 GPU (10 epochs)

Why is WER higher than CER?

A single character error makes the entire word wrong
Example: "splendid" → "splondid" (1 char error = 1 word error)
This is normal for OCR systems

🔬 Prediction Examples

Sample Predictions (Validation Set)

Ground Truth	Model Prediction	Analysis
`It was a splendid interpretation of the`	`It was a splendid inteyetation of thatf`	✅ 85% correct, minor char confusions
`sympathetic C O . Paul Daneman gave another`	`sympathetie CD. Sul abameman gave anotherf`	⚠️ Struggles with names, punctuation
`part . The rest of the cast were well chosen ,`	`pat . The nit of the cast were well chosen .f .`	✅ Most words correct, extra punctuation

Common Error Patterns:

🔤 Character confusions: e↔c, r↔n, a↔o
👤 Proper nouns: Lower accuracy on names
✍️ Punctuation: Extra/missing spaces around symbols
🔚 End-of-line artifacts: Extra f or . characters

🚀 Quick Start

1️⃣ Load Pre-trained Model

from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="IsmatS/handwriting-recognition-iam",
    filename="best_model.pth"
)

# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu', weights_only=False)
print(f"Model trained for {checkpoint['epoch']} epochs")
print(f"Validation CER: {checkpoint['val_cer']:.4f}")

2️⃣ Inference on Your Own Images

from PIL import Image
import numpy as np

# Load your handwritten text image
img = Image.open('your_handwriting.png').convert('L')

# Preprocess (resize to height=128, maintain aspect ratio)
w, h = img.size
new_w = int(128 * (w / h))
img = img.resize((new_w, 128), Image.LANCZOS)

# Normalize
img_array = np.array(img, dtype=np.float32) / 255.0
img_array = (img_array - 0.5) / 0.5

# Convert to tensor
img_tensor = torch.FloatTensor(img_array).unsqueeze(0).unsqueeze(0)

# Predict (after loading model)
model.eval()
with torch.no_grad():
    output = model(img_tensor)
    prediction = decode_predictions(output, char_mapper)[0]

print(f"Predicted text: {prediction}")

3️⃣ Train Your Own Model

# Upload train_colab.ipynb to Google Colab
# Set Runtime → Change runtime type → GPU (T4)
# Run all cells

# Training takes ~1-2 hours for 10 epochs

📦 Installation

# Clone repository
git clone https://huggingface.co/IsmatS/handwriting-recognition-iam
cd handwriting-recognition-iam

# Install dependencies
pip install -r requirements.txt

# Download dataset (automatic in notebooks)
# from datasets import load_dataset
# dataset = load_dataset("Teklia/IAM-line")

Requirements

torch>=2.0.0
datasets>=2.14.0
pillow>=9.5.0
numpy>=1.24.0
matplotlib>=3.7.0
jiwer>=3.0.0
huggingface_hub>=0.16.0

📁 Project Structure

handwriting-recognition-iam/
├── 📓 train_colab.ipynb          # Complete training pipeline
├── 📊 analysis.ipynb              # Dataset exploration & EDA
├── 💾 best_model.pth              # Trained model checkpoint (105MB)
├── 📈 training_history.png        # Training curves visualization
├── 📋 requirements.txt            # Python dependencies
├── 📖 README.md                   # This file
└── 📂 charts/                     # Dataset analysis visualizations
    ├── 01_sample_images.png
    ├── 02_text_length_distribution.png
    ├── 03_image_dimensions.png
    ├── 04_character_frequency.png
    └── 05_summary_statistics.png

🎯 Use Cases

This model can be used for:

📝 Document Digitization - Convert handwritten notes to text
📧 Mail Processing - Read handwritten addresses
🏥 Medical Records - Digitize doctor's notes
🏫 Educational Tools - Auto-grade handwritten assignments
🏛️ Historical Archives - Transcribe historical documents
📱 Mobile Apps - Real-time handwriting recognition

🔧 Advanced Usage

Fine-tuning on Custom Data

# Load pre-trained model
checkpoint = torch.load('best_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])

# Freeze CNN layers (optional)
for param in model.cnn.parameters():
    param.requires_grad = False

# Train on your dataset
# ... (your training loop)

Batch Inference

# Process multiple images
predictions = []
for image_path in image_paths:
    img = preprocess_image(image_path)
    pred = model.predict(img)
    predictions.append(pred)

📊 Performance Benchmarks

Device	Batch Size	Inference Speed	Memory Usage
CPU (Intel i7)	1	~200-500ms/image	~500MB
GPU (T4)	8	~50-100ms/image	~2GB
GPU (V100)	16	~20-40ms/image	~4GB

🎓 Technical Details

Why CTC Loss?

Traditional OCR requires character-level bounding boxes. CTC eliminates this:

Traditional: Need positions: [H:0-10px, e:10-18px, l:18-24px, ...]
CTC: Just need text: "Hello" ✅

CTC learns alignment automatically during training.

Data Augmentation (Potential Improvements)

Currently not implemented, but could boost accuracy:

🔄 Rotation (±5°)
📏 Elastic distortion
🎨 Brightness/contrast variation
✂️ Random crops
🌊 Wave distortion

Expected gain: +2-5% accuracy

🚧 Limitations

Current known limitations:

❌ Single-line only - Doesn't handle multi-line paragraphs
❌ English only - Trained on English text (74 ASCII characters)
❌ Cursive struggles - Lower accuracy on highly cursive writing
❌ Proper nouns - Names and uncommon words have higher error rates
❌ Punctuation - Sometimes adds/removes punctuation

🔮 Future Improvements

Potential enhancements:

✅ Attention Mechanism - Replace/augment LSTM with Transformer
✅ Data Augmentation - Improve robustness
✅ Larger Model - Scale to 20-50M parameters
✅ Multi-line Support - Detect and process paragraphs
✅ Language Models - Post-process with spelling correction
✅ Multilingual - Extend to other languages

📚 References

IAM Database: Marti & Bunke, 2002
CTC Loss: Graves et al., 2006
CRNN: Shi et al., 2015
Dataset on HF: Teklia/IAM-line

📄 License

Code: MIT License
Model Weights: MIT License
IAM Dataset: Free for research use (see dataset license)

🙏 Acknowledgments

🎓 University of Bern for the IAM Database
🤗 Hugging Face for hosting dataset and model
🔥 PyTorch team for the framework
📊 Teklia for preparing the HF dataset version

📧 Contact

For questions, issues, or collaboration:

🤗 Hugging Face: @IsmatS

⭐ If you find this project useful, please consider giving it a star! ⭐

Made with ❤️ using PyTorch and Hugging Face

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for IsmatS/handwriting_recognition

Base model

IsmatS/handwriting-recognition-iam

Finetuned

(1)

this model

Dataset used to train IsmatS/handwriting_recognition

Collection including IsmatS/handwriting_recognition

Handwriting Recognition & OCR

Collection

Handwriting recognition systems: CNN-BiLSTM-CTC trained on IAM (CER 12.95%), Azerbaijani document OCR dataset. • 2 items • Updated Mar 19

Paper for IsmatS/handwriting_recognition

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Paper • 1507.05717 • Published Jul 21, 2015

Evaluation results

Character Error Rate on IAM Handwriting Database
self-reported

0.130
Word Error Rate on IAM Handwriting Database
self-reported

0.425