A3-Doc

A3-Doc is a document + chart understanding specialist in the Schneewolf Labs A-series — a focused-excellence Stage-2 full fine-tune of A3 on ChartDocMix-v1. Where A3-Instruct is the generalist sibling, A3-Doc trades breadth for depth on the ChartQA / DocVQA / InfoVQA / TextVQA / OCRBench class of tasks.

What it is


Architecture	Qwen3-VL ViT (frozen, ~0.41 B) + 2-layer MLP projector (trained) + A2/Mistral decoder (full FFT)
Total params	12.69 B (12.28 B trainable in Stage-2; ViT frozen)
Base	`schneewolflabs/A3`
Training corpus	`schneewolflabs/ChartDocMix-v1` (241,435 rows: ~96% doc/chart/OCR VQA + 4% identity rehearsal)
Epochs	1 (15,075 steps)
Effective batch	16 (bs 1 × grad-accum 16)
Optimizer	paged AdamW 8-bit
Learning rate	1e-5, cosine, warmup 3%
Max seq length	2048
Vision token cap	max_pixels = 512×512 (262 K px) — see the resolution note below
Hardware	1× NVIDIA GB10 (DGX Spark, 128 GB unified)
Wall-clock	~3.3 days
Final eval loss	0.499 (down from 0.647 at the first eval)

The single-domain corpus is far more learnable than the generalist mix: A3-Doc reaches eval loss 0.499, well under A3-Instruct's 0.752 on the broad corpus.

Benchmarks

Greedy decoding, lmms-eval terse-answer prompt convention, 500-row validation slices (a fast read — see caveats). Metrics: ChartQA relaxed accuracy, DocVQA / InfoVQA ANLS, TextVQA VQA-accuracy, OCRBench contains-accuracy.

Benchmark	A3-Doc	Metric
ChartQA	53.2	relaxed acc
DocVQA	48.4	ANLS
InfoVQA	34.2	ANLS
TextVQA	71.6	VQA acc
OCRBench	67.0 (670/1000)	contains

For a Path-B graft trained on 241 K rows, TextVQA and OCRBench are genuinely respectable — scene-text and OCR transferred well. DocVQA/InfoVQA are the weak spots, and the reason is known (below).

Caveats: numbers are a 500-row slice, not full splits. ChartQA's test split interleaves human_test (harder) and augmented_test (easier) and the published number averages both — a flat 500-row sample may over-represent one type. Treat these as indicative, not leaderboard-final.

The resolution finding (important)

A3-Doc was trained and evaluated at max_pixels = 512×512. DocVQA and InfoVQA are high-resolution document scans where text is tiny, so at 512² much of the text is illegible. This is the dominant limiter on those two tasks.

Diagnostic — eval-only, no retraining, same 200 rows:

Benchmark	@512² (262 K px)	@1280² (1.64 M px)	Δ
DocVQA (ANLS)	0.525	0.580	+5.5
InfoVQA (ANLS)	0.385	0.420	+3.5

The frozen ViT + projector + decoder generalize to higher visual-token counts despite only seeing 512² in training. The eval-only gain is a floor; a retrain at higher max_pixels should beat it. If you run A3-Doc yourself, raise max_pixels (the ArtemisVLMProcessor accepts it) for document tasks — it costs more tokens/latency but helps.

Intended use

Document & chart VQA, infographic QA, OCR-style reading, chart captioning. For broad conversation/creative use reach for A3-Instruct; for dense image captioning reach for A3.

Inference

from transformers import AutoConfig, AutoTokenizer
from artemis_vlm import ArtemisVLMForConditionalGeneration, ArtemisVLMProcessor
import torch

ckpt = "schneewolflabs/A3-Doc"
model = ArtemisVLMForConditionalGeneration.from_pretrained(ckpt, dtype=torch.bfloat16).to("cuda")
cfg = AutoConfig.from_pretrained(ckpt, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(ckpt, trust_remote_code=True)
# raise max_pixels for document tasks (training default was 512*512):
proc = ArtemisVLMProcessor(tokenizer=tok, vision_config=cfg.vision_config,
                           max_pixels=1280*1280)

Also runs in llama.cpp via the Schneewolf-Labs/llama.cpp fork's Artemis VLM mmproj graft (same pattern as A3 / A3-Instruct).

Roadmap — A3-Doc-v2

The resolution finding points to the obvious next lever: retrain at 1024²–1280² max_pixels rather than 512². Same corpus, same recipe, higher vision budget. Expected to push DocVQA/InfoVQA well past the eval-only gains.

Lineage

schneewolflabs/A3 — Stage-1 base (projector-only alignment)
schneewolflabs/A2 — text decoder (Mistral 12.3 B)
schneewolflabs/ChartDocMix-v1 — training corpus
schneewolflabs/i-DPO — identity/voice anti-drift bedrock

License

apache-2.0, consistent with the rest of the A-series lineage. Constituent training sources carry their own licenses (see the ChartDocMix-v1 card).

Downloads last month: 3

Safetensors

Model size

13B params

Tensor type

BF16

Model tree for schneewolflabs/A3-Doc

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

schneewolflabs/A3