Buckets:
| # DiffusionGemma vs Gemma-4 on post-OCR correction — experiment scripts | |
| Scripts behind the [Post-OCR Gazette demo Space](https://huggingface.co/spaces/davanstrien/diffusiongemma-ocr-correction): | |
| a first-pass benchmark of [google/diffusiongemma-26B-A4B-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it) | |
| (experimental block-diffusion LLM, 26B MoE / 4B active) against autoregressive Gemma-4 baselines | |
| ([gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it), and | |
| [gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) as the parameter-matched MoE arm) | |
| on post-OCR correction of 19th-century English newspaper text. | |
| All experiments ran on [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) | |
| (one A100-80GB, bf16, batch 1) — no local GPU involved. | |
| ## Experiment log | |
| Every experiment is logged under [`experiments/`](experiments/) — one directory per | |
| run with a README (design, config, findings) and publishable artifacts (summaries, | |
| text-free per-passage metrics). Start with the | |
| [experiment index](experiments/README.md). | |
| ## Files | |
| - `benchmark.py` — generation only. Self-contained UV script (PEP 723 inline deps): | |
| downloads BLN600, samples and align-trims passages to DiffusionGemma's 256-token | |
| canvas, runs each model sequentially with timing, writes `raw_outputs.jsonl`. | |
| Includes the OCR-seeded-canvas condition (`--canvas-init`) via the undocumented | |
| `decoder_input_ids` hook in `DiffusionGemmaForBlockDiffusion.generate()`. | |
| - `metrics.py` — all metrics, computed offline from the JSONL (CER/WER via jiwer, | |
| over-correction rate and fix rate via character alignment). `uv run metrics.py test`. | |
| - `vibe_image.py` — small image-input vibe check (text vs image+text vs image-only | |
| conditions); see notes in the Space. | |
| ## Reproduce | |
| ```bash | |
| # smoke (3 passages, verbose) | |
| hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ | |
| benchmark.py -- --mode smoke | |
| # full benchmark (all three model arms) | |
| hf jobs uv run --flavor a100-large --timeout 4h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ | |
| benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo> | |
| # metrics (local, CPU) | |
| uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/ | |
| ``` | |
| ## Results (v2, 2026-06-11, n=75 BLN600 passages, all arms one job) | |
| | Condition | CER (input 0.066) | Rel. CER red. | Over-corr. | Fix rate | Median s | tok/s | | |
| |---|---|---|---|---|---|---| | |
| | DiffusionGemma (default) | 0.035 | 49.5% | 1.5% | 86.0% | **1.69** | **119.9** | | |
| | Gemma-4-E4B (greedy) | 0.042 | 45.9% | **0.4%** | 61.5% | 15.33 | 12.9 | | |
| | Gemma-4-26B-A4B MoE (greedy) | **0.027** | **62.4%** | 0.9% | **87.5%** | 16.31 | 12.0 | | |
| The parameter-matched MoE wins on quality; DiffusionGemma is ~10× faster at equal | |
| capacity (and reproduces its v1 numbers to within noise). The v1 OCR-seeded-canvas | |
| condition (CER 0.081, copy-through collapse) and its attempted rescue are in the | |
| [experiment log](experiments/README.md). | |
| Limitations: n=75, single prompt, one run per arm, no significance testing; | |
| 256-token block caps passage length; no greedy mode exists for the diffusion | |
| sampler; model was one day old at benchmark time. | |
| ## Data & raw outputs | |
| The eval data is not duplicated in this bucket — grab it from the source (or let | |
| the scripts do it: both corpora are fetched and parsed automatically at run time, | |
| so a Jobs run needs no data setup): | |
| - **BLN600** (CC-BY-NC-4.0): [figshare DOI 10.15131/shef.data.25439023](https://doi.org/10.15131/shef.data.25439023) | |
| — one ~68 MB zip (password `BLN600`) with aligned `OCR Text/`, `Ground Truth/` | |
| and article-cropped `Images/` folders. | |
| - **ICDAR2019 post-OCR** (CC-BY-4.0): [zenodo record 3515403](https://zenodo.org/records/3515403) | |
| — `benchmark.py --dataset icdar`; source of the Space's demo passages. | |
| Raw generation outputs (which embed BLN600 text) live in a private, git-versioned | |
| dataset repo rather than being mirrored here. The bucket carries the text-free | |
| per-passage metrics, which support most reanalysis (bootstrap CIs, significance | |
| tests, per-passage plots) — and generation is fully seeded, so the scripts | |
| regenerate raw outputs bit-for-bit from the source data. | |
| ## Picking up this work (humans or agents) | |
| Everything needed to reproduce or extend these experiments is in this bucket — no | |
| other repo required. The workflow, end to end: | |
| 1. **Run generation on HF Jobs — directly from this bucket, no download needed.** | |
| Every script is self-contained (PEP 723 inline deps) and `hf jobs uv run` | |
| accepts a URL; bucket files resolve at | |
| `https://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve/<file>`. | |
| (Needs an HF account with Jobs billing; `hf auth login`.) | |
| Always smoke first — measured costs on `a100-large` (~$2.50/h): | |
| ```bash | |
| BUCKET=https://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve | |
| # smoke: ~8 min wall-clock ≈ $0.35 (model download runs at multi-GB/s, ~2 min for 52 GB) | |
| hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ | |
| $BUCKET/benchmark.py -- --mode smoke | |
| # full three-arm benchmark: ~60-80 min ≈ $3 | |
| hf jobs uv run --flavor a100-large --timeout 3h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ | |
| $BUCKET/benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo> | |
| # canvas sweep screen stage (27-cell factorial + anchor, n=20): ~30-40 min ≈ $1.50 | |
| hf jobs uv run --flavor a100-large --timeout 90m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \ | |
| $BUCKET/canvas_sweep.py -- --stage screen --out-repo <your-private-dataset-repo> | |
| ``` | |
| 2. **To modify a script first**, copy it locally, edit, and run the local file the | |
| same way: | |
| ```bash | |
| hf buckets cp hf://buckets/davanstrien/diffusiongemma-ocr-bench/canvas_sweep.py . | |
| ``` | |
| Raw outputs contain BLN600 text (CC-BY-NC) → `--out-repo` must be **private**. | |
| Monitor with `hf jobs logs <job-id>`; list with `hf jobs ps`. | |
| 3. **Compute metrics** (CPU, seconds — locally, or as a URL-run cpu-basic job): | |
| ```bash | |
| hf download <your-private-dataset-repo> raw_outputs.jsonl --repo-type dataset --local-dir . | |
| uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/ | |
| uv run metrics.py sweep --infile raw_outputs_canvas_sweep_screen.jsonl --outdir results/ | |
| ``` | |
| Metrics are deliberately decoupled from generation: metric changes never require | |
| re-running GPU jobs, and the metric files contain no copyrighted text — safe to publish. | |
| 4. **Log your experiment.** Convention: one directory per experiment under | |
| [`experiments/`](experiments/), named `YYYY-MM-DD_slug/`, containing a README | |
| (question, design — ideally written *before* results — findings, including | |
| negative ones) plus the text-free metric artifacts. Add a row to the | |
| [experiment index](experiments/README.md). Upload with: | |
| ```bash | |
| hf buckets cp -r my-experiment-dir hf://buckets/<your-bucket>/experiments/YYYY-MM-DD_slug | |
| ``` | |
| Useful context for new experiments: DiffusionGemma's sampler knobs live in its | |
| `generation_config.json` (`t_max`/`t_min` temperature schedule, EntropyBoundSampler | |
| `entropy_bound`, `confidence_threshold`, `max_denoising_steps` 48, fixed 256-token | |
| canvas). The undocumented `decoder_input_ids` kwarg of | |
| `DiffusionGemmaForBlockDiffusion.generate()` replaces the random initial canvas — | |
| see `canvas_sweep.py` for a worked example, and the v1 experiment README for the | |
| engineering gotchas (output channel markers, `.sequences` includes the prompt, | |
| streamer subclassing for step counts). | |
| By [davanstrien](https://huggingface.co/davanstrien). | |
Xet Storage Details
- Size:
- 7.69 kB
- Xet hash:
- af206d44922eb5770bca7eb7134e9cf594d8711930435313899ff1b3852dcb2f
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.