Buckets:
| # Experiment log | |
| One directory per experiment, newest last. Each has a README (design, config, | |
| findings) plus any publishable artifacts (metrics, summaries — never raw BLN600 | |
| text, which is CC-BY-NC). | |
| | Date | Experiment | One-line result | | |
| |---|---|---| | |
| | 2026-06-10 | [v1 text benchmark](2026-06-10_v1-bln600-text/) — DiffusionGemma vs Gemma-4-E4B, 75 BLN600 passages | Diffusion wins on CER (0.036 vs 0.042) and is ~8.5× faster; OCR-seeded canvas collapses to copy-through | | |
| | 2026-06-11 | [image-input vibe check](2026-06-11_image-vibe/) — does the source page image help correction? | Grounding is weak and can manufacture false confidence at low resolution; image-only OCR is weak; parked | | |
| | 2026-06-11 | [canvas-rescue sweep](2026-06-11_canvas-rescue/) — can t_max / entropy_bound / canvas noise make the OCR-seeded canvas edit instead of copy? (pre-registered) | **Negative.** Knobs break the copy-through (steps 3→33) but editing never becomes correcting: best cell CER 0.063 ≈ doing nothing (0.064), far behind random-canvas 0.030 — and slower. Needs training-time support | | |
| | 2026-06-11 | [MoE baseline](2026-06-11_moe-baseline/) — gemma-4-26B-A4B-it, the parameter-matched AR twin (per João Gante) | **Quality headline flips:** MoE wins CER 0.027 vs 0.035, but DiffusionGemma is ~10× faster at equal capacity. v1 numbers reproduce | | |
| ## Next steps (logged, not yet committed) | |
| **Scaled-up v2 eval — data identified, extension decision pending.** A survey of | |
| post-OCR eval sources found two easy-to-grab additions to full BLN600 (n=600): | |
| - **Overproof datasets 2+3** ([overproof.projectcomputing.com/datasets](https://overproof.projectcomputing.com/datasets/)) — | |
| 208 hand-corrected newspaper articles (Sydney Morning Herald 1842–1954 via Trove; | |
| Chronicling America 1871–1921). Plain-HTTP download, line-aligned OCR‖gold. | |
| Different collections *and* OCR pipelines than BLN600 → generalization test. | |
| No formal license (sources are public domain): eval fine, redistribution needs a | |
| permission ask. Dataset 3's source pages carry ABBYY ALTO per-word/char | |
| confidences — the bridge to a confidence-guided correction experiment. | |
| - **NCSE transcribed articles** ([DOI 10.5522/04/25805008.v1](https://rdr.ucl.ac.uk/articles/dataset/Transcribed_newspaper_articles_from_the_NCSE_collection/25805008)) — | |
| 91 pairs, 40.7k words, 19th-c periodicals, much noisier OCR than BLN600, **CC0**, | |
| purpose-made human gold with published CLOCR-C LLM baselines. Single small zip. | |
| Also on the list: bootstrap confidence intervals in metrics.py (works retroactively | |
| on any outputs file) and multiple sampler seeds for the diffusion arm. Rejected as | |
| eval gold after the survey: PleIAs/Post-OCR-Correction, ChroniclingAmericaQA, | |
| Scrambled Text (all model-generated "gold" — training material only), RETAS | |
| (Gutenberg-aligned, not page-faithful), NOD (synthetic noise), ICDAR 2017 EN held | |
| in reserve (Google Drive distribution, bespoke license, needs dedup vs ICDAR 2019). | |
Xet Storage Details
- Size:
- 3.01 kB
- Xet hash:
- cbe6ddc804ea5518d8428e5bb2bdd522eaadca2a067a64cb0f50986dad3a9efd
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.