| | --- |
| | license: mit |
| | datasets: |
| | - wmt/wmt19 |
| | language: |
| | - en |
| | - de |
| | pipeline_tag: translation |
| | --- |
| | |
| | # Seq2Seq German-English Translation Model |
| |
|
| | A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture. |
| |
|
| | ## Model Description |
| |
|
| | This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation: |
| |
|
| | - **Encoder**: 2-layer LSTM that processes German input sequences |
| | - **Decoder**: 2-layer LSTM that generates English output sequences |
| | - **Training Strategy**: Teacher forcing during training, autoregressive generation during inference |
| | - **Vocabulary**: 30k German words, 25k English words |
| | - **Dataset**: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset) |
| |
|
| | ## Model Architecture |
| |
|
| | ``` |
| | German Input β Embedding β LSTM Encoder β Context Vector β LSTM Decoder β Embedding β English Output |
| | ``` |
| |
|
| | **Hyperparameters:** |
| | - Embedding size: 256 |
| | - Hidden size: 512 |
| | - LSTM layers: 2 (both encoder/decoder) |
| | - Dropout: 0.3 |
| | - Batch size: 64 |
| | - Learning rate: 0.0003 |
| |
|
| | ## Training Data |
| |
|
| | - **Dataset**: WMT19 German-English Translation Task |
| | - **Size**: 2M sentence pairs (filtered subset) |
| | - **Preprocessing**: Sentences filtered by length (5-50 tokens) |
| | - **Tokenization**: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`) |
| |
|
| | ## Performance |
| |
|
| | **Training Results (5 epochs):** |
| | - Initial Training Loss: 4.0949 β Final: 3.1843 (91% improvement) |
| | - Initial Validation Loss: 4.1918 β Final: 3.8537 (34% improvement) |
| | - Training Device: Apple Silicon (MPS) |
| |
|
| | ## Usage |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | # This is a custom PyTorch model, not a Transformers model |
| | # Download the files and use with the provided inference script |
| | |
| | import requests |
| | from pathlib import Path |
| | |
| | # Download model files |
| | base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main" |
| | files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"] |
| | |
| | for file in files: |
| | response = requests.get(f"{base_url}/{file}") |
| | Path(file).write_bytes(response.content) |
| | print(f"Downloaded {file}") |
| | ``` |
| |
|
| | ### Translation Examples |
| |
|
| | ```bash |
| | # Interactive mode |
| | python inference.py --interactive |
| | |
| | # Single translation |
| | python inference.py --sentence "Hallo, wie geht es dir?" --verbose |
| | |
| | # Demo mode |
| | python inference.py |
| | ``` |
| |
|
| | **Example Translations:** |
| | - `"Das ist ein gutes Buch."` β `"this is a good idea."` |
| | - `"Wo ist der Bahnhof?"` β `"where is the <UNK>"` |
| | - `"Ich liebe Deutschland."` β `"i share."` |
| |
|
| | ## Files Included |
| |
|
| | - `best_model.pt`: PyTorch model checkpoint (trained weights + architecture) |
| | - `german_tokenizer.pkl`: German vocabulary and tokenization logic |
| | - `english_tokenizer.pkl`: English vocabulary and tokenization logic |
| |
|
| | ## Installation & Setup |
| |
|
| | 1. **Clone the repository:** |
| | ```bash |
| | git clone https://github.com/sumitdotml/seq2seq |
| | cd seq2seq |
| | ``` |
| |
|
| | 2. **Set up environment:** |
| | ```bash |
| | uv venv && source .venv/bin/activate # or python -m venv .venv |
| | uv pip install torch requests tqdm # or pip install torch requests tqdm |
| | ``` |
| |
|
| | 3. **Download model:** |
| | ```bash |
| | python scripts/download_pretrained.py |
| | ``` |
| |
|
| | 4. **Start translating:** |
| | ```bash |
| | python scripts/inference.py --interactive |
| | ``` |
| |
|
| | ## Model Architecture Details |
| |
|
| | The model uses a custom implementation with these components: |
| |
|
| | - **Encoder** (`src/models/encoder.py`): LSTM-based encoder with embedding layer |
| | - **Decoder** (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture |
| | - **Seq2Seq** (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic |
| |
|
| | ## Limitations |
| |
|
| | - **Vocabulary constraints**: Limited to 30k German / 25k English words |
| | - **Training data**: Only 2M sentence pairs (vs 35M in full WMT19) |
| | - **No attention mechanism**: Basic encoder-decoder without attention |
| | - **Simple tokenization**: Word-level tokenization without subword units |
| | - **Translation quality**: Suitable for basic phrases, struggles with complex sentences |
| |
|
| | ## Training Details |
| |
|
| | **Environment:** |
| | - Framework: PyTorch 2.0+ |
| | - Device: Apple Silicon (MPS acceleration) |
| | - Training time: ~5 epochs |
| | - Validation strategy: Hold-out validation set |
| |
|
| | **Optimization:** |
| | - Optimizer: Adam (lr=0.0003) |
| | - Loss function: CrossEntropyLoss (ignoring padding) |
| | - Gradient clipping: 1.0 |
| | - Scheduler: StepLR (step_size=3, gamma=0.5) |
| | |
| | ## Reproduce Training |
| | |
| | ```bash |
| | # Full training pipeline |
| | python scripts/data_preparation.py # Download WMT19 data |
| | python src/data/tokenization.py # Build vocabularies |
| | python scripts/train.py # Train model |
| |
|
| | # For full dataset training, modify data_preparation.py: |
| | # use_full_dataset = True # Line 133-134 |
| | ``` |
| | |
| | ## Citation |
| | |
| | If you use this model, please cite: |
| | |
| | ```bibtex |
| | @misc{seq2seq-de-en, |
| | author = {sumitdotml}, |
| | title = {German-English Seq2Seq Translation Model}, |
| | year = {2025}, |
| | url = {https://huggingface.co/sumitdotml/seq2seq-de-en}, |
| | note = {PyTorch implementation of sequence-to-sequence translation} |
| | } |
| | ``` |
| | |
| | ## References |
| | |
| | - Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS. |
| | - WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19 |
| | |
| | ## License |
| | |
| | MIT License - See repository for full license text. |
| | |
| | ## Contact |
| | |
| | For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq). |