Update README.md

e42ffd1 verified 7 months ago

5.5 kB

	---
	license: mit
	datasets:
	- wmt/wmt19
	language:
	- en
	- de
	pipeline_tag: translation
	---

	# Seq2Seq German-English Translation Model

	A sequence-to-sequence neural machine translation model that translates German text to English, built using PyTorch with LSTM encoder-decoder architecture.

	## Model Description

	This model implements the classic seq2seq architecture from [Sutskever et al. (2014)](https://arxiv.org/abs/1409.3215) for German-English translation:

	- Encoder: 2-layer LSTM that processes German input sequences
	- Decoder: 2-layer LSTM that generates English output sequences
	- Training Strategy: Teacher forcing during training, autoregressive generation during inference
	- Vocabulary: 30k German words, 25k English words
	- Dataset: Trained on 2M sentence pairs from WMT19 (subset of full 35M dataset)

	## Model Architecture

	```
	German Input → Embedding → LSTM Encoder → Context Vector → LSTM Decoder → Embedding → English Output
	```

	Hyperparameters:
	- Embedding size: 256
	- Hidden size: 512
	- LSTM layers: 2 (both encoder/decoder)
	- Dropout: 0.3
	- Batch size: 64
	- Learning rate: 0.0003

	## Training Data

	- Dataset: WMT19 German-English Translation Task
	- Size: 2M sentence pairs (filtered subset)
	- Preprocessing: Sentences filtered by length (5-50 tokens)
	- Tokenization: Custom word-level tokenizer with special tokens (`<PAD>`, `<UNK>`, `<START>`, `<END>`)

	## Performance

	Training Results (5 epochs):
	- Initial Training Loss: 4.0949 → Final: 3.1843 (91% improvement)
	- Initial Validation Loss: 4.1918 → Final: 3.8537 (34% improvement)
	- Training Device: Apple Silicon (MPS)

	## Usage

	### Quick Start

	```python
	# This is a custom PyTorch model, not a Transformers model
	# Download the files and use with the provided inference script

	import requests
	from pathlib import Path

	# Download model files
	base_url = "https://huggingface.co/sumitdotml/seq2seq-de-en/resolve/main"
	files = ["best_model.pt", "german_tokenizer.pkl", "english_tokenizer.pkl"]

	for file in files:
	response = requests.get(f"{base_url}/{file}")
	Path(file).write_bytes(response.content)
	print(f"Downloaded {file}")
	```

	### Translation Examples

	```bash
	# Interactive mode
	python inference.py --interactive

	# Single translation
	python inference.py --sentence "Hallo, wie geht es dir?" --verbose

	# Demo mode
	python inference.py
	```

	Example Translations:
	- `"Das ist ein gutes Buch."` → `"this is a good idea."`
	- `"Wo ist der Bahnhof?"` → `"where is the <UNK>"`
	- `"Ich liebe Deutschland."` → `"i share."`

	## Files Included

	- `best_model.pt`: PyTorch model checkpoint (trained weights + architecture)
	- `german_tokenizer.pkl`: German vocabulary and tokenization logic
	- `english_tokenizer.pkl`: English vocabulary and tokenization logic

	## Installation & Setup

	1. Clone the repository:
	```bash
	git clone https://github.com/sumitdotml/seq2seq
	cd seq2seq
	```

	2. Set up environment:
	```bash
	uv venv && source .venv/bin/activate # or python -m venv .venv
	uv pip install torch requests tqdm # or pip install torch requests tqdm
	```

	3. Download model:
	```bash
	python scripts/download_pretrained.py
	```

	4. Start translating:
	```bash
	python scripts/inference.py --interactive
	```

	## Model Architecture Details

	The model uses a custom implementation with these components:

	- Encoder (`src/models/encoder.py`): LSTM-based encoder with embedding layer
	- Decoder (`src/models/decoder.py`): LSTM-based decoder with attention-free architecture
	- Seq2Seq (`src/models/seq2seq.py`): Main model combining encoder-decoder with generation logic

	## Limitations

	- Vocabulary constraints: Limited to 30k German / 25k English words
	- Training data: Only 2M sentence pairs (vs 35M in full WMT19)
	- No attention mechanism: Basic encoder-decoder without attention
	- Simple tokenization: Word-level tokenization without subword units
	- Translation quality: Suitable for basic phrases, struggles with complex sentences

	## Training Details

	Environment:
	- Framework: PyTorch 2.0+
	- Device: Apple Silicon (MPS acceleration)
	- Training time: ~5 epochs
	- Validation strategy: Hold-out validation set

	Optimization:
	- Optimizer: Adam (lr=0.0003)
	- Loss function: CrossEntropyLoss (ignoring padding)
	- Gradient clipping: 1.0
	- Scheduler: StepLR (step_size=3, gamma=0.5)

	## Reproduce Training

	```bash
	# Full training pipeline
	python scripts/data_preparation.py # Download WMT19 data
	python src/data/tokenization.py # Build vocabularies
	python scripts/train.py # Train model

	# For full dataset training, modify data_preparation.py:
	# use_full_dataset = True # Line 133-134
	```

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{seq2seq-de-en,
	author = {sumitdotml},
	title = {German-English Seq2Seq Translation Model},
	year = {2025},
	url = {https://huggingface.co/sumitdotml/seq2seq-de-en},
	note = {PyTorch implementation of sequence-to-sequence translation}
	}
	```

	## References

	- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. NeurIPS.
	- WMT19 Translation Task: https://huggingface.co/datasets/wmt/wmt19

	## License

	MIT License - See repository for full license text.

	## Contact

	For questions about this model or training code, please open an issue in the [GitHub repository](https://github.com/sumitdotml/seq2seq).