A3-Doc
A3-Doc is a document + chart understanding specialist in the Schneewolf Labs A-series — a focused-excellence Stage-2 full fine-tune of A3 on ChartDocMix-v1. Where A3-Instruct is the generalist sibling, A3-Doc trades breadth for depth on the ChartQA / DocVQA / InfoVQA / TextVQA / OCRBench class of tasks.
What it is
| Architecture | Qwen3-VL ViT (frozen, ~0.41 B) + 2-layer MLP projector (trained) + A2/Mistral decoder (full FFT) |
| Total params | 12.69 B (12.28 B trainable in Stage-2; ViT frozen) |
| Base | schneewolflabs/A3 |
| Training corpus | schneewolflabs/ChartDocMix-v1 (241,435 rows: ~96% doc/chart/OCR VQA + 4% identity rehearsal) |
| Epochs | 1 (15,075 steps) |
| Effective batch | 16 (bs 1 × grad-accum 16) |
| Optimizer | paged AdamW 8-bit |
| Learning rate | 1e-5, cosine, warmup 3% |
| Max seq length | 2048 |
| Vision token cap | max_pixels = 512×512 (262 K px) — see the resolution note below |
| Hardware | 1× NVIDIA GB10 (DGX Spark, 128 GB unified) |
| Wall-clock | ~3.3 days |
| Final eval loss | 0.499 (down from 0.647 at the first eval) |
The single-domain corpus is far more learnable than the generalist mix: A3-Doc reaches eval loss 0.499, well under A3-Instruct's 0.752 on the broad corpus.
Benchmarks
Greedy decoding, lmms-eval terse-answer prompt convention, 500-row validation slices (a fast read — see caveats). Metrics: ChartQA relaxed accuracy, DocVQA / InfoVQA ANLS, TextVQA VQA-accuracy, OCRBench contains-accuracy.
| Benchmark | A3-Doc | Metric |
|---|---|---|
| ChartQA | 53.2 | relaxed acc |
| DocVQA | 48.4 | ANLS |
| InfoVQA | 34.2 | ANLS |
| TextVQA | 71.6 | VQA acc |
| OCRBench | 67.0 (670/1000) | contains |
For a Path-B graft trained on 241 K rows, TextVQA and OCRBench are genuinely respectable — scene-text and OCR transferred well. DocVQA/InfoVQA are the weak spots, and the reason is known (below).
Caveats: numbers are a 500-row slice, not full splits. ChartQA's test
split interleaves human_test (harder) and augmented_test (easier) and the
published number averages both — a flat 500-row sample may over-represent one
type. Treat these as indicative, not leaderboard-final.
The resolution finding (important)
A3-Doc was trained and evaluated at max_pixels = 512×512. DocVQA and
InfoVQA are high-resolution document scans where text is tiny, so at 512² much
of the text is illegible. This is the dominant limiter on those two tasks.
Diagnostic — eval-only, no retraining, same 200 rows:
| Benchmark | @512² (262 K px) | @1280² (1.64 M px) | Δ |
|---|---|---|---|
| DocVQA (ANLS) | 0.525 | 0.580 | +5.5 |
| InfoVQA (ANLS) | 0.385 | 0.420 | +3.5 |
The frozen ViT + projector + decoder generalize to higher visual-token
counts despite only seeing 512² in training. The eval-only gain is a floor;
a retrain at higher max_pixels should beat it. If you run A3-Doc yourself,
raise max_pixels (the ArtemisVLMProcessor accepts it) for document
tasks — it costs more tokens/latency but helps.
Intended use
Document & chart VQA, infographic QA, OCR-style reading, chart captioning. For broad conversation/creative use reach for A3-Instruct; for dense image captioning reach for A3.
Inference
from transformers import AutoConfig, AutoTokenizer
from artemis_vlm import ArtemisVLMForConditionalGeneration, ArtemisVLMProcessor
import torch
ckpt = "schneewolflabs/A3-Doc"
model = ArtemisVLMForConditionalGeneration.from_pretrained(ckpt, dtype=torch.bfloat16).to("cuda")
cfg = AutoConfig.from_pretrained(ckpt, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(ckpt, trust_remote_code=True)
# raise max_pixels for document tasks (training default was 512*512):
proc = ArtemisVLMProcessor(tokenizer=tok, vision_config=cfg.vision_config,
max_pixels=1280*1280)
Also runs in llama.cpp via the Schneewolf-Labs/llama.cpp fork's Artemis VLM
mmproj graft (same pattern as A3 / A3-Instruct).
Roadmap — A3-Doc-v2
The resolution finding points to the obvious next lever: retrain at 1024²–1280² max_pixels rather than 512². Same corpus, same recipe, higher vision budget. Expected to push DocVQA/InfoVQA well past the eval-only gains.
Lineage
schneewolflabs/A3— Stage-1 base (projector-only alignment)schneewolflabs/A2— text decoder (Mistral 12.3 B)schneewolflabs/ChartDocMix-v1— training corpusschneewolflabs/i-DPO— identity/voice anti-drift bedrock
License
apache-2.0, consistent with the rest of the A-series lineage. Constituent training sources carry their own licenses (see the ChartDocMix-v1 card).
- Downloads last month
- 3