---
tags:
- text-diffusion
- machine-translation
- en-de
- masked-diffusion
- from-scratch
language:
- en
- de
datasets:
- wmt/wmt14
license: apache-2.0
---

# Text Diffusion Model for EN→DE Translation

A **masked discrete diffusion** model for English-to-German machine translation, trained from scratch on WMT14 EN-DE.

## Architecture

| Component | Detail |
|---|---|
| **Type** | Masked Discrete Diffusion |
| **Backbone** | DiT (Diffusion Transformer) with adaLN |
| **Parameters** | ~72M |
| **Blocks** | 12 DiT blocks |
| **Hidden dim** | 512, 8 attention heads |
| **Attention** | Bidirectional (no causal mask) with RoPE |
| **Conditioning** | Timestep via sinusoidal embeddings + adaLN; Segment embeddings for src/tgt |
| **Weight tying** | Input embeddings tied to output projection |
| **Tokenizer** | [Helsinki-NLP/opus-mt-en-de](https://huggingface.co/Helsinki-NLP/opus-mt-en-de) (~58K vocab) |
| **Max sequence** | 128 src + 128 tgt tokens |

### Inspired by
- **[MDLM](https://arxiv.org/abs/2406.07524)** — DiT backbone architecture, masked diffusion objective
- **[LLaDA](https://arxiv.org/abs/2502.09992)** — Conditional generation via SFT (keep prompt unmasked, mask only target), 1/t ELBO weighting
- **[DiNoiSer](https://arxiv.org/abs/2302.10025)** — Noise manipulation for conditional seq2seq diffusion

## How It Works

### Training (Forward Diffusion)
1. Source (EN) and target (DE) tokens are concatenated: `[source | target]`
2. A random masking rate `t ~ Uniform(0, 1)` is sampled per example
3. Each target token is independently masked with probability `t`
4. The bidirectional DiT predicts all masked tokens simultaneously
5. Loss = cross-entropy on masked positions only, weighted by `1/t` (continuous-time ELBO)

### Inference (Reverse Diffusion)
1. Start with source tokens + fully masked target: `[source | MASK MASK ... MASK]`
2. Over 50 denoising steps, iteratively predict and unmask tokens
3. At each step `t → s`: predict all masked tokens, randomly re-mask a fraction `s/t`
4. Final step: all remaining masks are filled with predictions

## Training Details

| Setting | Value |
|---|---|
| **Dataset** | WMT14 EN-DE (~4.5M parallel sentence pairs) |
| **Optimizer** | AdamW (lr=3e-4, β₁=0.9, β₂=0.98, wd=0.01) |
| **Schedule** | Cosine with 4K linear warmup |
| **Effective batch size** | 256 (64 × 4 gradient accumulation) |
| **Max steps** | 200,000 |
| **Mixed precision** | FP16 |
| **Gradient clipping** | max_norm=1.0 |
| **Evaluation** | SacreBLEU on WMT14 test set every 20K steps |

## Quick Start

### Install dependencies

```bash
pip install torch transformers datasets trackio sacrebleu sacremoses sentencepiece protobuf
```

### Train

```bash
git clone https://huggingface.co/vedkdev/text-diffusion-en-de
cd text-diffusion-en-de
python train.py
```

The script will:
- Download WMT14 EN-DE automatically
- Train for 200K steps with logging via [Trackio](https://huggingface.co/docs/trackio)
- Evaluate SacreBLEU periodically
- Push checkpoints to this repo

### Adjusting for your hardware

Edit the `TRAIN_CONFIG` dict in `train.py`:

| GPU VRAM | Recommended `batch_size` | `gradient_accumulation_steps` |
|---|---|---|
| 24GB (A10G/3090/4090) | 64 | 4 |
| 16GB (T4/V100) | 32 | 8 |
| 12GB (3060) | 16 | 16 |
| 8GB (3070) | 8 | 32 |

### Inference (after training)

```python
import torch, json
from train import DiffusionTranslator, DiffusionTranslatorConfig, generate
from transformers import AutoTokenizer

# Load checkpoint
config = DiffusionTranslatorConfig(**json.load(open("checkpoints/best/config.json")))
model = DiffusionTranslator(config)
model.load_state_dict(torch.load("checkpoints/best/model.pt", map_location="cpu"))
model.eval()

tokenizer = AutoTokenizer.from_pretrained("checkpoints/best/")

# Translate
text = "The weather is nice today."
src = tokenizer(f"translate English to German: {text}",
                max_length=128, truncation=True, padding="max_length",
                return_tensors="pt")

gen_ids = generate(model, src["input_ids"], torch.zeros_like(src["input_ids"]),
                   config, num_steps=50, device="cpu")
print(tokenizer.decode(gen_ids[0], skip_special_tokens=True))
```

## Expected Results

Based on published literature for similar architectures on WMT14 EN→DE:

| Model | BLEU | Reference |
|---|---|---|
| Autoregressive Transformer | ~27 | Vaswani et al. |
| DiNoiSer (continuous diffusion) | 24.6 | Ye et al. 2023 |
| SeqDiffuSeq | 19.8 | Yuan et al. 2022 |
| E2D2 (discrete diffusion) | 24.8 | Kuleshov et al. 2024 |
| **This model (target)** | **15-20** | ~72M params, no KD |

> Note: Text diffusion models typically score 2-5 BLEU below autoregressive transformers of similar size. Knowledge distillation (KD) from an AR teacher can close the gap by ~1-2 BLEU.

## Citation

If you use this model, please cite the foundational papers:

```bibtex
@article{sahoo2024mdlm,
  title={Simple and Effective Masked Diffusion Language Models},
  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Kuleshov, Volodymyr},
  journal={NeurIPS},
  year={2024}
}

@article{nie2025llada,
  title={Large Language Diffusion Models},
  author={Nie, Shen and Zhu, Fengqi and You, Chao and Zhang, Xiaojun and Ou, Zhenguo and Zhu, Jun},
  journal={arXiv preprint arXiv:2502.09992},
  year={2025}
}

@article{ye2023dinoiser,
  title={DiNoiSer: Diffused Conditional Sequence Learning by Manipulating Noises},
  author={Ye, Jiasheng and Zheng, Zaixiang and Bao, Yu and Qian, Lihua and Gu, Quanquan},
  journal={ACL},
  year={2023}
}
```