Instructions to use schneewolflabs/A3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use schneewolflabs/A3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="schneewolflabs/A3")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained("schneewolflabs/A3", device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use schneewolflabs/A3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "schneewolflabs/A3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "schneewolflabs/A3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/schneewolflabs/A3

SGLang

How to use schneewolflabs/A3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "schneewolflabs/A3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "schneewolflabs/A3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "schneewolflabs/A3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "schneewolflabs/A3",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use schneewolflabs/A3 with Docker Model Runner:
```
docker model run hf.co/schneewolflabs/A3
```

A3

Project Artemis — Stage-1. A3 is the projector-aligned successor to A3-preview: same architecture, 40× more training data, ~10× more optimizer steps, dramatically sharper image grounding. A3 produces detailed descriptive captions on diverse images. It is not an instruction-tuned VLM — visual VQA, OCR-as- task, multi-turn visual reasoning, and tool calling are the job of Artemis (Stage-2), which builds on A3.

What this is

The same LLaVA-style graft as A3-preview:

Component	Source	Role
Vision tower	`Qwen/Qwen3-VL-2B-Instruct` (ViT, ~600M params)	Image → visual feature tokens
Projector	Trained 2-layer MLP, ~45M params	Visual hidden → text hidden bridge
Language model	`schneewolflabs/A2` (~12B params)	Unchanged decoder

Only the projector trained. Vision tower and decoder are frozen exactly as published — A2's reasoning, tool calling, identity, and Qwen3 chat template behavior are byte-identical to A2 text-only by construction.

Training details

Setting	Value
Corpus	`BLIP3o/BLIP3o-Pretrain-Long-Caption` (1,000,000 streamed samples)
Optimizer	AdamW (fp32 moments), lr 1e-3 cosine to 0, 5% warmup
Effective batch	32 (bs=2 × grad_accum=16)
Steps	31,219 (1 epoch, ~0.5% held out for eval)
Precision	bfloat16
Hardware	single NVIDIA GB10 (DGX Spark)
Wall clock	130h 54m ≈ 5.45 days
Train loss	5.55 → 0.65
Eval loss	5.55 → 0.60 on held-out BLIP3o

Eval curve (held-out subset)

The full power-law decay across the run:

Step	%	eval/loss
0	0%	5.5491
1,560	5%	0.7823 ← matched A3-preview's final score in 1/10th the steps
6,240	20%	0.6787
15,600	50%	0.6248
23,400	75%	0.6027
28,080	90%	0.6007
31,219	100%	0.6006

Cleanly asymptoted in the cosine tail — the projector has converged for this LR schedule.

Hold-out comparison vs A3-preview

A3 vs A3-preview on the same 4 held-out images (2 BLIP3o + 2 OOD Japanese photos, deterministic generation, max_new_tokens=150). A3 wins decisively on 3/4, ties on 1.

1. BLIP3o — abandoned lighthouse


Ground truth	"weathered, abandoned lighthouse... peeling paint and rust"
A3-preview	"old, weathered structure... single-story with flat roof and ladder"
A3	"old, weathered lighthouse... peeling paint and visible rust"

2. BLIP3o — historic Indian fort wall


Ground truth	"fort wall with arrow slits... India"
A3-preview	"weathered stone wall with rectangular openings... India"
A3	"historic fort wall... reddish-brown stone... for defensive purposes... blue roofs"

3. OOD — bar scene with cocktails


Ground truth	"Charles Vanot Curaçao Bleu liqueur, Panda gin, treble clef stirrer"
A3-preview	'"Charles Vian"... blue bottle' (missed Panda entirely)
A3	'Charles Vannier, Panda, musical note stirrer'

4. OOD — Gundam statue at LaLaport


Ground truth	"RX-93ff Nu Gundam... LaLaport Fukuoka... LED lights"
A3-preview	"Gundam statue... Lalaport mall... illuminated by spotlights"
A3	'Gundam statue... "Lalaport"... life-sized model... overcast sky'

The pattern: A3 names the actual subject instead of describing it generically, and the brand/entity text reading is meaningfully sharper. The loss-curve delta (eval 0.77 → 0.60) translates directly into specific-entity recognition.

What works

Descriptive captioning across diverse domains — image-grounded, names specific objects, picks up brand text on OOD inputs, identifies real-world subjects (lighthouses, fort walls, statues, products). The 12B A2 decoder gives the prose noticeably more fluency than 2B-class VLMs at the same task.

What this is not

Not a fully instruction-tuned VLM. Visual instruction-following, VQA ("count the cats", "what color is the second item"), OCR-as-task, multi-turn visual reasoning, and tool calling on images are not trained here. Asked to do any of those, A3 will fall back to "describe."
No safety / refusal tuning on visual inputs.
No multi-image or video — single image per turn.

These are precisely what Artemis (Stage-2) is for.

What's next

Artemis — Stage-2 multimodal instruction FFT on top of A3. Full FFT of the decoder + projector, frozen ViT, with heavy text rehearsal (A2 tool calling + i-DPO identity + reasoning data) to protect the underlying A2 capabilities through the visual instruction phase. The named flagship multimodal release.

Install

pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0'

The artemis-vlm package contains the model class, processor, and data collator. On import, it registers artemis_vlm with HuggingFace AutoConfig and AutoModelForCausalLM so from_pretrained() resolves without trust_remote_code.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import artemis_vlm  # registers ArtemisVLM with AutoConfig / AutoModel

model = AutoModelForCausalLM.from_pretrained(
    "schneewolflabs/A3", dtype=torch.bfloat16,
).to("cuda").eval()

tok = AutoTokenizer.from_pretrained("schneewolflabs/A3")
processor = artemis_vlm.ArtemisVLMProcessor(
    tokenizer=tok, vision_config=model.visual.config,
    min_pixels=32 * 32, max_pixels=512 * 512,
)

# Qwen3 chat-template multimodal message
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

from PIL import Image
image = Image.open("your_image.jpg")
batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))

Architecture notes (Path B — composition not modification)

A3 grafts a Qwen3-VL vision tower onto an unmodified A2 text decoder via a learned 2-layer MLP projector. The decoder's vocabulary, weights, chat template, reasoning, tool calling, and identity are byte-identical to A2 text-only — the multimodal addition cannot regress text capability because the text computation path doesn't change.

Vision tokens enter through A2's repurposed reserved-token layout (<|image_pad|> = token id 22 in the A-series Tekken vocab — see the A1 release notes for the full token-id allocation across <think>, <tool_call>, vision, etc.). The Qwen tokenizer is never in the picture; the projector bridges hidden spaces, not token spaces.

See the Artemis repo README for the full architectural breakdown.

License

Apache 2.0. Same as A1, A2, A3-preview, and the underlying Qwen3-VL vision tower and BLIP3o-Pretrain-Long-Caption corpus.

Acknowledgements

BLIP3o team for the Pretrain-Long-Caption corpus
Qwen team for the Qwen3-VL vision encoder
LLaVA project for the architectural template
Schneewolf Labs Merlina + Artemis for the training and architecture infrastructure

— Schneewolf Labs · Project Artemis · Stage-1

Downloads last month: 4

Safetensors

Model size

13B params

Tensor type

BF16

Model tree for schneewolflabs/A3

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(223)

this model

Finetunes

2 models

schneewolflabs
/

A3

A3