๐Ÿ–‹๏ธ Handwriting Recognition with Deep Learning

Model Dataset License PyTorch

A complete end-to-end handwriting recognition system using CNN-BiLSTM-CTC architecture

๐ŸŽฏ Model โ€ข ๐Ÿ“Š Dataset Analysis โ€ข ๐Ÿ—๏ธ Architecture โ€ข ๐Ÿ“ˆ Performance โ€ข ๐Ÿš€ Quick Start


๐ŸŽฏ Overview

This project implements a state-of-the-art Handwriting Recognition system that converts handwritten text images into digital text. The model achieves 87% character-level accuracy on the IAM Handwriting Database.

Key Highlights

  • โœ… CNN-BiLSTM-CTC Architecture - Industry-standard OCR architecture
  • โœ… 9.1M Parameters - Efficient yet powerful model
  • โœ… CER: 12.95% - High character recognition accuracy
  • โœ… IAM Dataset - 10,000+ handwritten text samples
  • โœ… Google Colab Compatible - Train on free GPU
  • โœ… Production Ready - Complete inference pipeline

๐Ÿ”— Resources

Resource Link Description
๐Ÿค— Trained Model IsmatS/handwriting-recognition-iam Pre-trained weights (105MB)
๐Ÿ“ฆ Dataset Teklia/IAM-line IAM Handwriting Database
๐Ÿ““ Training Notebook train_colab.ipynb Full training pipeline
๐Ÿ“Š Analysis Notebook analysis.ipynb Dataset exploration

๐Ÿ“Š Dataset Insights

The IAM Handwriting Database is one of the most widely-used datasets for handwriting recognition research. Here's what we discovered:

Dataset Statistics

Split Samples Usage
Train 6,482 Model training
Validation 976 Hyperparameter tuning
Test 2,915 Final evaluation
Total 10,373 Complete dataset

๐Ÿ“ธ Sample Images

Real handwritten text samples from the dataset:

Sample Images

Observations:

  • โœ๏ธ Diverse writing styles (cursive, print, mixed)
  • ๐Ÿ“ Variable text lengths (10-100+ characters)
  • ๐ŸŽจ Different pen types and ink intensity
  • ๐Ÿ“ Natural variations in slant and spacing

๐Ÿ“ Text Length Distribution

Text Length Distribution

Key Insights:

  • ๐Ÿ“Š Mean length: ~48-60 characters per line
  • ๐Ÿ“ˆ Peak: 40-70 character range (most common)
  • ๐Ÿ”ข Range: 5-150 characters
  • ๐ŸŽฏ Implication: Model must handle variable-length sequences efficiently

Why this matters: The CTC (Connectionist Temporal Classification) loss function in our model is specifically designed to handle this variability without requiring character-level alignment annotations.


๐Ÿ“ Image Dimensions Analysis

Image Dimensions

Dimensional Characteristics:

Metric Width Height Aspect Ratio
Mean ~400-500px ~50-100px ~6-8:1
Min ~100px ~30px ~3:1
Max ~1200px ~150px ~15:1

Engineering Decision:

  • ๐Ÿ”„ Fixed height: Resize to 128px (preserves vertical features)
  • ๐Ÿ“ Variable width: Maintain aspect ratio (prevents distortion)
  • ๐ŸŽฏ Result: Preserves legibility while standardizing input

๐Ÿ”ค Character Frequency Analysis

Character Frequency

Character Distribution:

  • ๐Ÿ”ก Lowercase dominates: 'e', 't', 'a', 'o', 'n' (English frequency)
  • ๐Ÿ”  Capitals less common: Sentence beginnings, proper nouns
  • ๐Ÿ”ข Numbers rare: Limited numeric content
  • โš™๏ธ Punctuation: Periods, commas most frequent

Implications:

  • ๐Ÿ“š 74 unique characters: a-z, A-Z, 0-9, space, punctuation
  • โš–๏ธ Class imbalance: Model sees more common characters
  • ๐ŸŽ“ Training strategy: No special balancing needed (mirrors real-world text)

๐Ÿ“‹ Summary Statistics

Summary Statistics

Complete Statistical Overview:

  • ๐Ÿ“Š Min/Max/Mean for all features
  • ๐Ÿ“ˆ Standard deviations
  • ๐ŸŽฏ Quartile distributions
  • ๐Ÿ” Outlier detection

๐Ÿ—๏ธ Model Architecture

Our CNN-BiLSTM-CTC architecture combines three powerful components:

Input Image (128 x Variable Width)
           โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚  CNN Layers  โ”‚  โ† Extract visual features
    โ”‚   (7 blocks) โ”‚     (edges, strokes, characters)
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ†“
    Feature Maps (512 channels)
           โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚   BiLSTM     โ”‚  โ† Model sequential dependencies
    โ”‚  (2 layers)  โ”‚     (left-to-right + right-to-left)
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ†“
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ CTC Decoder  โ”‚  โ† Alignment-free decoding
    โ”‚  (75 chars)  โ”‚     (handles variable lengths)
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ†“
    Predicted Text

Component Breakdown

1๏ธโƒฃ CNN Feature Extractor (7 Convolutional Blocks)

Block Layers Output Channels Purpose
1 Conv + BN + ReLU + MaxPool 64 Basic edge detection
2 Conv + BN + ReLU + MaxPool 128 Stroke patterns
3 Conv + BN + ReLU 256 Character components
4 Conv + BN + ReLU + MaxPool(2,1) 256 Horizontal compression
5 Conv + BN + ReLU 512 Complex features
6 Conv + BN + ReLU + MaxPool(2,1) 512 Further compression
7 Conv + BN + ReLU 512 Final features

Key Design Choices:

Design Decision Rationale
Batch Normalization Normalizes activations โ†’ faster training, prevents internal covariate shift
Asymmetric pooling (2,1) Compress height but preserve width โ†’ maintains character boundaries
Progressive channels (64โ†’512) More filters = richer features at deeper layers
No pooling in Conv 3,5 Maintains spatial resolution for detail preservation

Why Asymmetric MaxPool (2,1)?

Regular MaxPool (2,2):
  Image: [128, 400] โ†’ [64, 200] โ†’ [32, 100] โ†’ [16, 50]
  Problem: Loses too much horizontal resolution โŒ
  Result: Character boundaries blur together

Asymmetric MaxPool (2,1):
  Image: [128, 400] โ†’ [64, 400] โ†’ [32, 400] โ†’ [16, 400]
  Benefit: Preserves horizontal details โœ…
  Result: Each character remains distinct

2๏ธโƒฃ Bidirectional LSTM (Sequence Modeling)

Configuration:
- Input Size: 256
- Hidden Size: 256
- Num Layers: 2
- Bidirectional: Yes (512 output)
- Dropout: 0.3

Why BiLSTM?

  • โฌ…๏ธ Forward pass: Reads left-to-right (like humans)
  • โžก๏ธ Backward pass: Reads right-to-left (context from future)
  • ๐Ÿ”„ Combined: Each character sees full sentence context

3๏ธโƒฃ CTC Loss (Alignment-Free Training)

Advantages:

  • ๐ŸŽฏ No character-level position labels needed
  • ๐Ÿ“ Handles variable-length input/output
  • ๐Ÿ”„ Learns temporal alignment automatically
  • โœ… Industry standard for OCR/speech recognition

Total Parameters: 9,139,147 (~9.1M)


๐Ÿ” Deep Dive: How the Model Works

Step-by-Step Processing Pipeline

1. Image Input Processing

Original Image: "Hello" (handwritten)
      โ†“
Resize: Height=128px, Width proportional
      โ†“
Normalize: Pixel values from [0,255] โ†’ [-1,1]
      โ†“
Tensor Shape: [Batch=1, Channels=1, Height=128, Width=W]

2. CNN Feature Extraction

The CNN progressively extracts hierarchical visual features:

Layer Type What It Detects Example
Conv1-2 (64-128 ch) Edges, lines, curves Vertical strokes, horizontal bars
Conv3-4 (256 ch) Stroke combinations Letter parts: tops of 't', loops in 'e'
Conv5-7 (512 ch) Character-level features Distinguish 'o' from 'a', 'n' from 'h'

Output: Feature map of shape [Batch, 512, 7, W_reduced]

  • Height reduced: 128 โ†’ 7 (18x compression)
  • Width reduced: ~W โ†’ W/4 (4x compression)
  • Channels increased: 1 โ†’ 512 (rich features)

3. Sequence-to-Sequence Mapping

# Convert 2D feature map to 1D sequence
Feature Map: [B, 512, 7, W/4]
      โ†“
Reshape: [B, W/4, 512*7] = [B, W/4, 3584]
      โ†“
Linear Layer: [B, W/4, 3584] โ†’ [B, W/4, 256]

Now we have a temporal sequence where each time step represents a horizontal segment of the image.

4. BiLSTM Sequential Modeling

Time step t:
  Forward LSTM โ†’  Reads: "H" "e" "l" "l" "o"
  Backward LSTM โ† Reads: "o" "l" "l" "e" "H"
                    โ†“
  Concatenate: [forward_256, backward_256] = 512
                    โ†“
  Context-aware representation for each character

Why bidirectional matters:

  • Forward: "H" knows it's at the start of a word
  • Backward: "H" knows "ello" comes after it
  • Combined: Better prediction accuracy

5. CTC Decoding

LSTM Output: [B, W/4, 512]
      โ†“
Linear: [B, W/4, 512] โ†’ [B, W/4, 75]  (75 = 74 chars + blank)
      โ†“
Softmax: Probability distribution over characters
      โ†“
CTC Decode: Remove blanks and duplicates

Example CTC Alignment:

Model output (frame by frame):
[-, -, H, H, H, -, e, e, -, l, l, l, -, l, -, o, o, -, -]

CTC decoding:
- Remove blanks (-)
- Collapse repeats
Result: "Hello" โœ…

๐Ÿ“ Understanding the Metrics

CER (Character Error Rate)

CER measures the edit distance at character level using Levenshtein distance.

Formula:

CER = (Insertions + Deletions + Substitutions) / Total_Characters_in_Ground_Truth

Example Calculation:

Ground Truth Prediction Operations CER
hello (5 chars) helo 1 deletion ('l') 1/5 = 20%
hello (5 chars) hallo 1 substitution ('e'โ†’'a') 1/5 = 20%
hello (5 chars) helloo 1 insertion ('o') 1/6 = 16.7%
hello (5 chars) hello 0 errors 0/5 = 0% โœ…

Our Model Performance:

CER = 12.95%

Example with 100 characters:
- Ground truth: 100 characters
- Errors: ~13 character mistakes
- Correct: ~87 characters โœ…

Character-level accuracy: 87.05%

What CER tells us:

  • โœ… Lower is better (0% = perfect)
  • โœ… Character-by-character accuracy
  • โœ… Sensitive to small mistakes
  • โœ… Good for measuring overall quality

WER (Word Error Rate)

WER measures the edit distance at word level.

Formula:

WER = (Word_Insertions + Word_Deletions + Word_Substitutions) / Total_Words_in_Ground_Truth

Example Calculation:

Ground Truth Prediction Word Errors WER
hello world (2 words) helo world 1 error ('hello'โ†’'helo') 1/2 = 50%
hello world (2 words) hello world 0 errors 0/2 = 0% โœ…
the quick brown fox (4 words) the quik brown fox 1 error ('quick'โ†’'quik') 1/4 = 25%

Our Model Performance:

WER = 42.47%

Example with 100 words:
- Ground truth: 100 words
- Word errors: ~42 words have at least 1 character wrong
- Correct words: ~58 words โœ…

Word-level accuracy: 57.53%

Why WER > CER?

One character error corrupts the entire word:

Ground Truth: "The magnificent castle stood tall"
Prediction:   "The magnifcent castle stood tall"
                        โ†‘ missing 'i'

Character errors: 1
Word errors: 1 (entire word "magnificent" is wrong)

CER = 1/34 = 2.9%
WER = 1/5 = 20%  โ† Much higher!

What WER tells us:

  • โœ… More strict than CER
  • โœ… Real-world usability measure
  • โœ… High WER with low CER = mostly correct characters but words incomplete
  • โš ๏ธ Can be harsh on OCR systems

CTC Loss

The loss function used during training.

What is CTC Loss?

Connectionist Temporal Classification (CTC) solves the alignment problem in sequence-to-sequence tasks.

The Problem CTC Solves:

Traditional approaches need exact character positions:

Image: "Hello"
Required labels:
- 'H' at pixels 0-20
- 'e' at pixels 21-35
- 'l' at pixels 36-50
- 'l' at pixels 51-65
- 'o' at pixels 66-80

This is impossible to annotate for handwriting!

CTC Solution:

Just provide the text: "Hello" โœ…

CTC figures out the alignment automatically:

Input Frames:  |---|---|---|---|---|---|---|---|---|
Model Output:  | - | H | H | e | - | l | l | o | - |
                 โ†“   โ†“   โ†“   โ†“   โ†“   โ†“   โ†“   โ†“   โ†“
CTC Decoding:  Remove blanks (-) and collapse repeats
Result:        "Hello" โœ…

How CTC Training Works:

  1. Blank token (ฮต): Special symbol for "no character"
  2. Multiple alignments: Many ways to align same text
  3. Sum probabilities: CTC sums all valid alignments

Example:

Target: "Hi"

Valid alignments:
- [H, i, -, -]
- [-, H, i, -]
- [H, H, i, i]
- [-, H, -, i]
... many more!

CTC Loss = -log(sum of probabilities of all valid paths)

Why CTC is Powerful:

โœ… No alignment needed: Just text labels โœ… Handles variable lengths: Input 100 frames โ†’ Output 5 characters โœ… Robust: Learns best alignment automatically โœ… Standard: Used in speech recognition, OCR, handwriting

CTC During Inference:

# Model outputs probabilities for each frame
output = model(image)  # Shape: [time_steps, batch, num_chars]

# Greedy decoding (simple approach)
best_path = torch.argmax(output, dim=2)  # Pick most likely char per frame
# Example: [-, -, H, H, e, e, -, l, l, l, o, -]

# CTC collapse
result = collapse_repeats_and_remove_blanks(best_path)
# Result: "Hello"

Advanced: Beam Search Decoding

Instead of greedy (picking top-1), beam search keeps top-K possibilities:

  • More accurate but slower
  • Can incorporate language models
  • Used in production systems

๐ŸŽฏ Model Performance Analysis

Accuracy by Character Type

Based on validation results, approximate accuracy:

Character Type Accuracy Notes
Lowercase (a-z) ~90% Most common, well-learned
Uppercase (A-Z) ~85% Less training data
Digits (0-9) ~80% Rare in dataset
Space ~95% Easy to detect
Punctuation (.,'") ~75% Often confused or missed

Common Confusions

Based on error analysis:

Ground Truth Often Predicted As Reason
e c, o Similar circular shapes
n u, r Stroke similarity
a o, e Loop closure ambiguity
i l, t Vertical strokes
rn m Combined strokes look like 'm'
cl d Close proximity โ†’ merged

Mitigation Strategies:

  • ๐Ÿ”„ Data augmentation focusing on confusable pairs
  • ๐Ÿ“š Language model post-processing (spell check)
  • ๐ŸŽฏ Attention mechanisms to focus on character boundaries

๐Ÿ“ˆ Training Results

Training Configuration

Hyperparameter Value Why This Value?
Epochs 10 Sweet spot for convergence; more epochs show diminishing returns
Batch Size 8 Balanced: Large enough for stable gradients, small enough for GPU memory
Learning Rate 0.001 Standard Adam LR; reduced automatically by scheduler if plateauing
Optimizer Adam Adaptive learning rates per parameter; industry standard
Scheduler ReduceLROnPlateau Reduces LR by 50% if validation loss doesn't improve for 3 epochs
Gradient Clip 5.0 Prevents exploding gradients common in RNNs/LSTMs
Image Height 128px Balance between detail preservation and computational efficiency
Dropout 0.3 Regularization to prevent overfitting in LSTM layers

Hyperparameter Rationale

Why Batch Size = 8?

Larger batch (16+):
  โœ… Faster training
  โŒ Requires more GPU memory
  โŒ Less gradient noise (can hurt generalization)

Smaller batch (4-):
  โœ… Fits in memory easily
  โœ… More gradient noise (better generalization)
  โŒ Slower training
  โŒ Unstable gradients

Batch=8: Sweet spot โœ…

Why Gradient Clipping = 5.0?

LSTMs are prone to exploding gradients:

Without clipping:
  Gradient = 10,000 โ†’ Model diverges โŒ

With clipping (max norm = 5.0):
  Gradient = 10,000 โ†’ Scaled down to 5.0 โœ…
  Training remains stable

Why ReduceLROnPlateau Scheduler?

Automatically adjusts learning rate when training stalls:

Epoch 1-5: LR = 0.001 (loss decreasing rapidly)
Epoch 6-8: LR = 0.001 (loss plateau detected)
Epoch 9+:  LR = 0.0005 (scheduler reduces by 50%)
           โ†’ Enables fine-tuning โœ…

Training Progress

Training History

Convergence Analysis:

Epoch Train Loss Val Loss CER โ†“ WER โ†“ Status
1 3.2065 2.6728 100.0% 100.0% Random init
2 1.6866 1.0331 29.3% 71.8% โšก Rapid learning
5 0.6004 0.5655 17.7% 53.1% ๐ŸŽฏ Good progress
7 0.4868 0.4595 14.4% 46.5% ๐Ÿ“Š Stable
10 0.3923 0.3836 12.95% 42.5% โœ… Best

Final Metrics

Metric Value Interpretation
Character Error Rate (CER) 12.95% ๐ŸŽฏ 87% characters correct
Word Error Rate (WER) 42.47% โœ… 57.5% words correct
Training Time ~20 minutes โšก On T4 GPU (10 epochs)

Why is WER higher than CER?

  • A single character error makes the entire word wrong
  • Example: "splendid" โ†’ "splondid" (1 char error = 1 word error)
  • This is normal for OCR systems

๐Ÿ”ฌ Prediction Examples

Sample Predictions (Validation Set)

Ground Truth Model Prediction Analysis
It was a splendid interpretation of the It was a splendid inteyetation of thatf โœ… 85% correct, minor char confusions
sympathetic C O . Paul Daneman gave another sympathetie CD. Sul abameman gave anotherf โš ๏ธ Struggles with names, punctuation
part . The rest of the cast were well chosen , pat . The nit of the cast were well chosen .f . โœ… Most words correct, extra punctuation

Common Error Patterns:

  • ๐Ÿ”ค Character confusions: eโ†”c, rโ†”n, aโ†”o
  • ๐Ÿ‘ค Proper nouns: Lower accuracy on names
  • โœ๏ธ Punctuation: Extra/missing spaces around symbols
  • ๐Ÿ”š End-of-line artifacts: Extra f or . characters

๐Ÿš€ Quick Start

1๏ธโƒฃ Load Pre-trained Model

from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(
    repo_id="IsmatS/handwriting-recognition-iam",
    filename="best_model.pth"
)

# Load checkpoint
checkpoint = torch.load(model_path, map_location='cpu', weights_only=False)
print(f"Model trained for {checkpoint['epoch']} epochs")
print(f"Validation CER: {checkpoint['val_cer']:.4f}")

2๏ธโƒฃ Inference on Your Own Images

from PIL import Image
import numpy as np

# Load your handwritten text image
img = Image.open('your_handwriting.png').convert('L')

# Preprocess (resize to height=128, maintain aspect ratio)
w, h = img.size
new_w = int(128 * (w / h))
img = img.resize((new_w, 128), Image.LANCZOS)

# Normalize
img_array = np.array(img, dtype=np.float32) / 255.0
img_array = (img_array - 0.5) / 0.5

# Convert to tensor
img_tensor = torch.FloatTensor(img_array).unsqueeze(0).unsqueeze(0)

# Predict (after loading model)
model.eval()
with torch.no_grad():
    output = model(img_tensor)
    prediction = decode_predictions(output, char_mapper)[0]

print(f"Predicted text: {prediction}")

3๏ธโƒฃ Train Your Own Model

# Upload train_colab.ipynb to Google Colab
# Set Runtime โ†’ Change runtime type โ†’ GPU (T4)
# Run all cells

# Training takes ~1-2 hours for 10 epochs

๐Ÿ“ฆ Installation

# Clone repository
git clone https://huggingface.co/IsmatS/handwriting-recognition-iam
cd handwriting-recognition-iam

# Install dependencies
pip install -r requirements.txt

# Download dataset (automatic in notebooks)
# from datasets import load_dataset
# dataset = load_dataset("Teklia/IAM-line")

Requirements

torch>=2.0.0
datasets>=2.14.0
pillow>=9.5.0
numpy>=1.24.0
matplotlib>=3.7.0
jiwer>=3.0.0
huggingface_hub>=0.16.0

๐Ÿ“ Project Structure

handwriting-recognition-iam/
โ”œโ”€โ”€ ๐Ÿ““ train_colab.ipynb          # Complete training pipeline
โ”œโ”€โ”€ ๐Ÿ“Š analysis.ipynb              # Dataset exploration & EDA
โ”œโ”€โ”€ ๐Ÿ’พ best_model.pth              # Trained model checkpoint (105MB)
โ”œโ”€โ”€ ๐Ÿ“ˆ training_history.png        # Training curves visualization
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt            # Python dependencies
โ”œโ”€โ”€ ๐Ÿ“– README.md                   # This file
โ””โ”€โ”€ ๐Ÿ“‚ charts/                     # Dataset analysis visualizations
    โ”œโ”€โ”€ 01_sample_images.png
    โ”œโ”€โ”€ 02_text_length_distribution.png
    โ”œโ”€โ”€ 03_image_dimensions.png
    โ”œโ”€โ”€ 04_character_frequency.png
    โ””โ”€โ”€ 05_summary_statistics.png

๐ŸŽฏ Use Cases

This model can be used for:

  • ๐Ÿ“ Document Digitization - Convert handwritten notes to text
  • ๐Ÿ“ง Mail Processing - Read handwritten addresses
  • ๐Ÿฅ Medical Records - Digitize doctor's notes
  • ๐Ÿซ Educational Tools - Auto-grade handwritten assignments
  • ๐Ÿ›๏ธ Historical Archives - Transcribe historical documents
  • ๐Ÿ“ฑ Mobile Apps - Real-time handwriting recognition

๐Ÿ”ง Advanced Usage

Fine-tuning on Custom Data

# Load pre-trained model
checkpoint = torch.load('best_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])

# Freeze CNN layers (optional)
for param in model.cnn.parameters():
    param.requires_grad = False

# Train on your dataset
# ... (your training loop)

Batch Inference

# Process multiple images
predictions = []
for image_path in image_paths:
    img = preprocess_image(image_path)
    pred = model.predict(img)
    predictions.append(pred)

๐Ÿ“Š Performance Benchmarks

Device Batch Size Inference Speed Memory Usage
CPU (Intel i7) 1 ~200-500ms/image ~500MB
GPU (T4) 8 ~50-100ms/image ~2GB
GPU (V100) 16 ~20-40ms/image ~4GB

๐ŸŽ“ Technical Details

Why CTC Loss?

Traditional OCR requires character-level bounding boxes. CTC eliminates this:

Traditional: Need positions: [H:0-10px, e:10-18px, l:18-24px, ...]
CTC: Just need text: "Hello" โœ…

CTC learns alignment automatically during training.

Data Augmentation (Potential Improvements)

Currently not implemented, but could boost accuracy:

  • ๐Ÿ”„ Rotation (ยฑ5ยฐ)
  • ๐Ÿ“ Elastic distortion
  • ๐ŸŽจ Brightness/contrast variation
  • โœ‚๏ธ Random crops
  • ๐ŸŒŠ Wave distortion

Expected gain: +2-5% accuracy


๐Ÿšง Limitations

Current known limitations:

  • โŒ Single-line only - Doesn't handle multi-line paragraphs
  • โŒ English only - Trained on English text (74 ASCII characters)
  • โŒ Cursive struggles - Lower accuracy on highly cursive writing
  • โŒ Proper nouns - Names and uncommon words have higher error rates
  • โŒ Punctuation - Sometimes adds/removes punctuation

๐Ÿ”ฎ Future Improvements

Potential enhancements:

  1. โœ… Attention Mechanism - Replace/augment LSTM with Transformer
  2. โœ… Data Augmentation - Improve robustness
  3. โœ… Larger Model - Scale to 20-50M parameters
  4. โœ… Multi-line Support - Detect and process paragraphs
  5. โœ… Language Models - Post-process with spelling correction
  6. โœ… Multilingual - Extend to other languages

๐Ÿ“š References


๐Ÿ“„ License

  • Code: MIT License
  • Model Weights: MIT License
  • IAM Dataset: Free for research use (see dataset license)

๐Ÿ™ Acknowledgments

  • ๐ŸŽ“ University of Bern for the IAM Database
  • ๐Ÿค— Hugging Face for hosting dataset and model
  • ๐Ÿ”ฅ PyTorch team for the framework
  • ๐Ÿ“Š Teklia for preparing the HF dataset version

๐Ÿ“ง Contact

For questions, issues, or collaboration:


โญ If you find this project useful, please consider giving it a star! โญ

Model Dataset

Made with โค๏ธ using PyTorch and Hugging Face

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train IsmatS/handwriting_recognition

Paper for IsmatS/handwriting_recognition