Buckets:

davanstrien
/

diffusiongemma-ocr-bench

Files

xet

davanstrien/diffusiongemma-ocr-bench / README.md

davanstrien

8 days ago

preview code

download

raw

7.69 kB

	# DiffusionGemma vs Gemma-4 on post-OCR correction — experiment scripts

	Scripts behind the [Post-OCR Gazette demo Space](https://huggingface.co/spaces/davanstrien/diffusiongemma-ocr-correction):
	a first-pass benchmark of [google/diffusiongemma-26B-A4B-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it)
	(experimental block-diffusion LLM, 26B MoE / 4B active) against autoregressive Gemma-4 baselines
	([gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it), and
	[gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) as the parameter-matched MoE arm)
	on post-OCR correction of 19th-century English newspaper text.

	All experiments ran on [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs)
	(one A100-80GB, bf16, batch 1) — no local GPU involved.

	## Experiment log

	Every experiment is logged under [`experiments/`](experiments/) — one directory per
	run with a README (design, config, findings) and publishable artifacts (summaries,
	text-free per-passage metrics). Start with the
	[experiment index](experiments/README.md).

	## Files

	- `benchmark.py` — generation only. Self-contained UV script (PEP 723 inline deps):
	downloads BLN600, samples and align-trims passages to DiffusionGemma's 256-token
	canvas, runs each model sequentially with timing, writes `raw_outputs.jsonl`.
	Includes the OCR-seeded-canvas condition (`--canvas-init`) via the undocumented
	`decoder_input_ids` hook in `DiffusionGemmaForBlockDiffusion.generate()`.
	- `metrics.py` — all metrics, computed offline from the JSONL (CER/WER via jiwer,
	over-correction rate and fix rate via character alignment). `uv run metrics.py test`.
	- `vibe_image.py` — small image-input vibe check (text vs image+text vs image-only
	conditions); see notes in the Space.

	## Reproduce

	```bash
	# smoke (3 passages, verbose)
	hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
	benchmark.py -- --mode smoke

	# full benchmark (all three model arms)
	hf jobs uv run --flavor a100-large --timeout 4h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
	benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo>

	# metrics (local, CPU)
	uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/
	```

	## Results (v2, 2026-06-11, n=75 BLN600 passages, all arms one job)

	\| Condition \| CER (input 0.066) \| Rel. CER red. \| Over-corr. \| Fix rate \| Median s \| tok/s \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| DiffusionGemma (default) \| 0.035 \| 49.5% \| 1.5% \| 86.0% \| 1.69 \| 119.9 \|
	\| Gemma-4-E4B (greedy) \| 0.042 \| 45.9% \| 0.4% \| 61.5% \| 15.33 \| 12.9 \|
	\| Gemma-4-26B-A4B MoE (greedy) \| 0.027 \| 62.4% \| 0.9% \| 87.5% \| 16.31 \| 12.0 \|

	The parameter-matched MoE wins on quality; DiffusionGemma is ~10× faster at equal
	capacity (and reproduces its v1 numbers to within noise). The v1 OCR-seeded-canvas
	condition (CER 0.081, copy-through collapse) and its attempted rescue are in the
	[experiment log](experiments/README.md).

	Limitations: n=75, single prompt, one run per arm, no significance testing;
	256-token block caps passage length; no greedy mode exists for the diffusion
	sampler; model was one day old at benchmark time.

	## Data & raw outputs

	The eval data is not duplicated in this bucket — grab it from the source (or let
	the scripts do it: both corpora are fetched and parsed automatically at run time,
	so a Jobs run needs no data setup):

	- BLN600 (CC-BY-NC-4.0): [figshare DOI 10.15131/shef.data.25439023](https://doi.org/10.15131/shef.data.25439023)
	— one ~68 MB zip (password `BLN600`) with aligned `OCR Text/`, `Ground Truth/`
	and article-cropped `Images/` folders.
	- ICDAR2019 post-OCR (CC-BY-4.0): [zenodo record 3515403](https://zenodo.org/records/3515403)
	— `benchmark.py --dataset icdar`; source of the Space's demo passages.

	Raw generation outputs (which embed BLN600 text) live in a private, git-versioned
	dataset repo rather than being mirrored here. The bucket carries the text-free
	per-passage metrics, which support most reanalysis (bootstrap CIs, significance
	tests, per-passage plots) — and generation is fully seeded, so the scripts
	regenerate raw outputs bit-for-bit from the source data.

	## Picking up this work (humans or agents)

	Everything needed to reproduce or extend these experiments is in this bucket — no
	other repo required. The workflow, end to end:

	1. Run generation on HF Jobs — directly from this bucket, no download needed.
	Every script is self-contained (PEP 723 inline deps) and `hf jobs uv run`
	accepts a URL; bucket files resolve at
	`https://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve/<file>`.
	(Needs an HF account with Jobs billing; `hf auth login`.)
	Always smoke first — measured costs on `a100-large` (~$2.50/h):
	```bash
	BUCKET=https://huggingface.co/buckets/davanstrien/diffusiongemma-ocr-bench/resolve

	# smoke: ~8 min wall-clock ≈ $0.35 (model download runs at multi-GB/s, ~2 min for 52 GB)
	hf jobs uv run --flavor a100-large --timeout 45m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
	$BUCKET/benchmark.py -- --mode smoke

	# full three-arm benchmark: ~60-80 min ≈ $3
	hf jobs uv run --flavor a100-large --timeout 3h -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
	$BUCKET/benchmark.py -- --mode full --n 75 --models all --out-repo <your-private-dataset-repo>

	# canvas sweep screen stage (27-cell factorial + anchor, n=20): ~30-40 min ≈ $1.50
	hf jobs uv run --flavor a100-large --timeout 90m -e HF_XET_HIGH_PERFORMANCE=1 -s HF_TOKEN \
	$BUCKET/canvas_sweep.py -- --stage screen --out-repo <your-private-dataset-repo>
	```
	2. To modify a script first, copy it locally, edit, and run the local file the
	same way:
	```bash
	hf buckets cp hf://buckets/davanstrien/diffusiongemma-ocr-bench/canvas_sweep.py .
	```
	Raw outputs contain BLN600 text (CC-BY-NC) → `--out-repo` must be private.
	Monitor with `hf jobs logs <job-id>`; list with `hf jobs ps`.
	3. Compute metrics (CPU, seconds — locally, or as a URL-run cpu-basic job):
	```bash
	hf download <your-private-dataset-repo> raw_outputs.jsonl --repo-type dataset --local-dir .
	uv run metrics.py summarize --infile raw_outputs.jsonl --outdir results/
	uv run metrics.py sweep --infile raw_outputs_canvas_sweep_screen.jsonl --outdir results/
	```
	Metrics are deliberately decoupled from generation: metric changes never require
	re-running GPU jobs, and the metric files contain no copyrighted text — safe to publish.
	4. Log your experiment. Convention: one directory per experiment under
	[`experiments/`](experiments/), named `YYYY-MM-DD_slug/`, containing a README
	(question, design — ideally written before results — findings, including
	negative ones) plus the text-free metric artifacts. Add a row to the
	[experiment index](experiments/README.md). Upload with:
	```bash
	hf buckets cp -r my-experiment-dir hf://buckets/<your-bucket>/experiments/YYYY-MM-DD_slug
	```

	Useful context for new experiments: DiffusionGemma's sampler knobs live in its
	`generation_config.json` (`t_max`/`t_min` temperature schedule, EntropyBoundSampler
	`entropy_bound`, `confidence_threshold`, `max_denoising_steps` 48, fixed 256-token
	canvas). The undocumented `decoder_input_ids` kwarg of
	`DiffusionGemmaForBlockDiffusion.generate()` replaces the random initial canvas —
	see `canvas_sweep.py` for a worked example, and the v1 experiment README for the
	engineering gotchas (output channel markers, `.sequences` includes the prompt,
	streamer subclassing for step counts).

	By [davanstrien](https://huggingface.co/davanstrien).

Xet Storage Details

Size:: 7.69 kB
Xet hash:: af206d44922eb5770bca7eb7134e9cf594d8711930435313899ff1b3852dcb2f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.