A3

Project Artemis — Stage-1. A3 is the projector-aligned successor to A3-preview: same architecture, 40× more training data, ~10× more optimizer steps, dramatically sharper image grounding. A3 produces detailed descriptive captions on diverse images. It is not an instruction-tuned VLM — visual VQA, OCR-as- task, multi-turn visual reasoning, and tool calling are the job of Artemis (Stage-2), which builds on A3.

What this is

The same LLaVA-style graft as A3-preview:

Component Source Role
Vision tower Qwen/Qwen3-VL-2B-Instruct (ViT, ~600M params) Image → visual feature tokens
Projector Trained 2-layer MLP, ~45M params Visual hidden → text hidden bridge
Language model schneewolflabs/A2 (~12B params) Unchanged decoder

Only the projector trained. Vision tower and decoder are frozen exactly as published — A2's reasoning, tool calling, identity, and Qwen3 chat template behavior are byte-identical to A2 text-only by construction.

Training details

Setting Value
Corpus BLIP3o/BLIP3o-Pretrain-Long-Caption (1,000,000 streamed samples)
Optimizer AdamW (fp32 moments), lr 1e-3 cosine to 0, 5% warmup
Effective batch 32 (bs=2 × grad_accum=16)
Steps 31,219 (1 epoch, ~0.5% held out for eval)
Precision bfloat16
Hardware single NVIDIA GB10 (DGX Spark)
Wall clock 130h 54m ≈ 5.45 days
Train loss 5.55 → 0.65
Eval loss 5.55 → 0.60 on held-out BLIP3o

Eval curve (held-out subset)

The full power-law decay across the run:

Step % eval/loss
0 0% 5.5491
1,560 5% 0.7823 ← matched A3-preview's final score in 1/10th the steps
6,240 20% 0.6787
15,600 50% 0.6248
23,400 75% 0.6027
28,080 90% 0.6007
31,219 100% 0.6006

Cleanly asymptoted in the cosine tail — the projector has converged for this LR schedule.

Hold-out comparison vs A3-preview

A3 vs A3-preview on the same 4 held-out images (2 BLIP3o + 2 OOD Japanese photos, deterministic generation, max_new_tokens=150). A3 wins decisively on 3/4, ties on 1.

1. BLIP3o — abandoned lighthouse

Ground truth "weathered, abandoned lighthouse... peeling paint and rust"
A3-preview "old, weathered structure... single-story with flat roof and ladder"
A3 "old, weathered lighthouse... peeling paint and visible rust"

2. BLIP3o — historic Indian fort wall

Ground truth "fort wall with arrow slits... India"
A3-preview "weathered stone wall with rectangular openings... India"
A3 "historic fort wall... reddish-brown stone... for defensive purposes... blue roofs"

3. OOD — bar scene with cocktails

Ground truth "Charles Vanot Curaçao Bleu liqueur, Panda gin, treble clef stirrer"
A3-preview '"Charles Vian"... blue bottle' (missed Panda entirely)
A3 'Charles Vannier, Panda, musical note stirrer'

4. OOD — Gundam statue at LaLaport

Ground truth "RX-93ff Nu Gundam... LaLaport Fukuoka... LED lights"
A3-preview "Gundam statue... Lalaport mall... illuminated by spotlights"
A3 'Gundam statue... "Lalaport"... life-sized model... overcast sky'

The pattern: A3 names the actual subject instead of describing it generically, and the brand/entity text reading is meaningfully sharper. The loss-curve delta (eval 0.77 → 0.60) translates directly into specific-entity recognition.

What works

Descriptive captioning across diverse domains — image-grounded, names specific objects, picks up brand text on OOD inputs, identifies real-world subjects (lighthouses, fort walls, statues, products). The 12B A2 decoder gives the prose noticeably more fluency than 2B-class VLMs at the same task.

What this is not

  • Not a fully instruction-tuned VLM. Visual instruction-following, VQA ("count the cats", "what color is the second item"), OCR-as-task, multi-turn visual reasoning, and tool calling on images are not trained here. Asked to do any of those, A3 will fall back to "describe."
  • No safety / refusal tuning on visual inputs.
  • No multi-image or video — single image per turn.

These are precisely what Artemis (Stage-2) is for.

What's next

  • Artemis — Stage-2 multimodal instruction FFT on top of A3. Full FFT of the decoder + projector, frozen ViT, with heavy text rehearsal (A2 tool calling + i-DPO identity + reasoning data) to protect the underlying A2 capabilities through the visual instruction phase. The named flagship multimodal release.

Install

pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0'

The artemis-vlm package contains the model class, processor, and data collator. On import, it registers artemis_vlm with HuggingFace AutoConfig and AutoModelForCausalLM so from_pretrained() resolves without trust_remote_code.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import artemis_vlm  # registers ArtemisVLM with AutoConfig / AutoModel

model = AutoModelForCausalLM.from_pretrained(
    "schneewolflabs/A3", dtype=torch.bfloat16,
).to("cuda").eval()

tok = AutoTokenizer.from_pretrained("schneewolflabs/A3")
processor = artemis_vlm.ArtemisVLMProcessor(
    tokenizer=tok, vision_config=model.visual.config,
    min_pixels=32 * 32, max_pixels=512 * 512,
)

# Qwen3 chat-template multimodal message
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

from PIL import Image
image = Image.open("your_image.jpg")
batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))

Architecture notes (Path B — composition not modification)

A3 grafts a Qwen3-VL vision tower onto an unmodified A2 text decoder via a learned 2-layer MLP projector. The decoder's vocabulary, weights, chat template, reasoning, tool calling, and identity are byte-identical to A2 text-only — the multimodal addition cannot regress text capability because the text computation path doesn't change.

Vision tokens enter through A2's repurposed reserved-token layout (<|image_pad|> = token id 22 in the A-series Tekken vocab — see the A1 release notes for the full token-id allocation across <think>, <tool_call>, vision, etc.). The Qwen tokenizer is never in the picture; the projector bridges hidden spaces, not token spaces.

See the Artemis repo README for the full architectural breakdown.

License

Apache 2.0. Same as A1, A2, A3-preview, and the underlying Qwen3-VL vision tower and BLIP3o-Pretrain-Long-Caption corpus.

Acknowledgements

  • BLIP3o team for the Pretrain-Long-Caption corpus
  • Qwen team for the Qwen3-VL vision encoder
  • LLaVA project for the architectural template
  • Schneewolf Labs Merlina + Artemis for the training and architecture infrastructure

— Schneewolf Labs · Project Artemis · Stage-1

Downloads last month
-
Safetensors
Model size
13B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for schneewolflabs/A3

Finetuned
(217)
this model

Dataset used to train schneewolflabs/A3