Buckets:

davanstrien
/

diffusiongemma-ocr-bench

Files

xet

davanstrien/diffusiongemma-ocr-bench / experiments /README.md

davanstrien

5 days ago

preview code

download

raw

3.01 kB

	# Experiment log

	One directory per experiment, newest last. Each has a README (design, config,
	findings) plus any publishable artifacts (metrics, summaries — never raw BLN600
	text, which is CC-BY-NC).

	\| Date \| Experiment \| One-line result \|
	\|---\|---\|---\|
	\| 2026-06-10 \| [v1 text benchmark](2026-06-10_v1-bln600-text/) — DiffusionGemma vs Gemma-4-E4B, 75 BLN600 passages \| Diffusion wins on CER (0.036 vs 0.042) and is ~8.5× faster; OCR-seeded canvas collapses to copy-through \|
	\| 2026-06-11 \| [image-input vibe check](2026-06-11_image-vibe/) — does the source page image help correction? \| Grounding is weak and can manufacture false confidence at low resolution; image-only OCR is weak; parked \|
	\| 2026-06-11 \| [canvas-rescue sweep](2026-06-11_canvas-rescue/) — can t_max / entropy_bound / canvas noise make the OCR-seeded canvas edit instead of copy? (pre-registered) \| Negative. Knobs break the copy-through (steps 3→33) but editing never becomes correcting: best cell CER 0.063 ≈ doing nothing (0.064), far behind random-canvas 0.030 — and slower. Needs training-time support \|
	\| 2026-06-11 \| [MoE baseline](2026-06-11_moe-baseline/) — gemma-4-26B-A4B-it, the parameter-matched AR twin (per João Gante) \| Quality headline flips: MoE wins CER 0.027 vs 0.035, but DiffusionGemma is ~10× faster at equal capacity. v1 numbers reproduce \|

	## Next steps (logged, not yet committed)

	Scaled-up v2 eval — data identified, extension decision pending. A survey of
	post-OCR eval sources found two easy-to-grab additions to full BLN600 (n=600):

	- Overproof datasets 2+3 ([overproof.projectcomputing.com/datasets](https://overproof.projectcomputing.com/datasets/)) —
	208 hand-corrected newspaper articles (Sydney Morning Herald 1842–1954 via Trove;
	Chronicling America 1871–1921). Plain-HTTP download, line-aligned OCR‖gold.
	Different collections and OCR pipelines than BLN600 → generalization test.
	No formal license (sources are public domain): eval fine, redistribution needs a
	permission ask. Dataset 3's source pages carry ABBYY ALTO per-word/char
	confidences — the bridge to a confidence-guided correction experiment.
	- NCSE transcribed articles ([DOI 10.5522/04/25805008.v1](https://rdr.ucl.ac.uk/articles/dataset/Transcribed_newspaper_articles_from_the_NCSE_collection/25805008)) —
	91 pairs, 40.7k words, 19th-c periodicals, much noisier OCR than BLN600, CC0,
	purpose-made human gold with published CLOCR-C LLM baselines. Single small zip.

	Also on the list: bootstrap confidence intervals in metrics.py (works retroactively
	on any outputs file) and multiple sampler seeds for the diffusion arm. Rejected as
	eval gold after the survey: PleIAs/Post-OCR-Correction, ChroniclingAmericaQA,
	Scrambled Text (all model-generated "gold" — training material only), RETAS
	(Gutenberg-aligned, not page-faithful), NOD (synthetic noise), ICDAR 2017 EN held
	in reserve (Google Drive distribution, bespoke license, needs dedup vs ICDAR 2019).

Xet Storage Details

Size:: 3.01 kB
Xet hash:: cbe6ddc804ea5518d8428e5bb2bdd522eaadca2a067a64cb0f50986dad3a9efd

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.