Instructions to use schneewolflabs/A3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use schneewolflabs/A3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="schneewolflabs/A3") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("schneewolflabs/A3", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use schneewolflabs/A3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "schneewolflabs/A3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schneewolflabs/A3", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/schneewolflabs/A3
- SGLang
How to use schneewolflabs/A3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "schneewolflabs/A3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schneewolflabs/A3", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "schneewolflabs/A3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "schneewolflabs/A3", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use schneewolflabs/A3 with Docker Model Runner:
docker model run hf.co/schneewolflabs/A3
A3
Project Artemis — Stage-1. A3 is the projector-aligned successor to A3-preview: same architecture, 40× more training data, ~10× more optimizer steps, dramatically sharper image grounding. A3 produces detailed descriptive captions on diverse images. It is not an instruction-tuned VLM — visual VQA, OCR-as- task, multi-turn visual reasoning, and tool calling are the job of Artemis (Stage-2), which builds on A3.
What this is
The same LLaVA-style graft as A3-preview:
| Component | Source | Role |
|---|---|---|
| Vision tower | Qwen/Qwen3-VL-2B-Instruct (ViT, ~600M params) |
Image → visual feature tokens |
| Projector | Trained 2-layer MLP, ~45M params | Visual hidden → text hidden bridge |
| Language model | schneewolflabs/A2 (~12B params) |
Unchanged decoder |
Only the projector trained. Vision tower and decoder are frozen exactly as published — A2's reasoning, tool calling, identity, and Qwen3 chat template behavior are byte-identical to A2 text-only by construction.
Training details
| Setting | Value |
|---|---|
| Corpus | BLIP3o/BLIP3o-Pretrain-Long-Caption (1,000,000 streamed samples) |
| Optimizer | AdamW (fp32 moments), lr 1e-3 cosine to 0, 5% warmup |
| Effective batch | 32 (bs=2 × grad_accum=16) |
| Steps | 31,219 (1 epoch, ~0.5% held out for eval) |
| Precision | bfloat16 |
| Hardware | single NVIDIA GB10 (DGX Spark) |
| Wall clock | 130h 54m ≈ 5.45 days |
| Train loss | 5.55 → 0.65 |
| Eval loss | 5.55 → 0.60 on held-out BLIP3o |
Eval curve (held-out subset)
The full power-law decay across the run:
| Step | % | eval/loss |
|---|---|---|
| 0 | 0% | 5.5491 |
| 1,560 | 5% | 0.7823 ← matched A3-preview's final score in 1/10th the steps |
| 6,240 | 20% | 0.6787 |
| 15,600 | 50% | 0.6248 |
| 23,400 | 75% | 0.6027 |
| 28,080 | 90% | 0.6007 |
| 31,219 | 100% | 0.6006 |
Cleanly asymptoted in the cosine tail — the projector has converged for this LR schedule.
Hold-out comparison vs A3-preview
A3 vs A3-preview on the same 4 held-out images (2 BLIP3o + 2 OOD Japanese photos, deterministic generation, max_new_tokens=150). A3 wins decisively on 3/4, ties on 1.
1. BLIP3o — abandoned lighthouse
| Ground truth | "weathered, abandoned lighthouse... peeling paint and rust" |
| A3-preview | "old, weathered structure... single-story with flat roof and ladder" |
| A3 | "old, weathered lighthouse... peeling paint and visible rust" |
2. BLIP3o — historic Indian fort wall
| Ground truth | "fort wall with arrow slits... India" |
| A3-preview | "weathered stone wall with rectangular openings... India" |
| A3 | "historic fort wall... reddish-brown stone... for defensive purposes... blue roofs" |
3. OOD — bar scene with cocktails
| Ground truth | "Charles Vanot Curaçao Bleu liqueur, Panda gin, treble clef stirrer" |
| A3-preview | '"Charles Vian"... blue bottle' (missed Panda entirely) |
| A3 | 'Charles Vannier, Panda, musical note stirrer' |
4. OOD — Gundam statue at LaLaport
| Ground truth | "RX-93ff Nu Gundam... LaLaport Fukuoka... LED lights" |
| A3-preview | "Gundam statue... Lalaport mall... illuminated by spotlights" |
| A3 | 'Gundam statue... "Lalaport"... life-sized model... overcast sky' |
The pattern: A3 names the actual subject instead of describing it generically, and the brand/entity text reading is meaningfully sharper. The loss-curve delta (eval 0.77 → 0.60) translates directly into specific-entity recognition.
What works
Descriptive captioning across diverse domains — image-grounded, names specific objects, picks up brand text on OOD inputs, identifies real-world subjects (lighthouses, fort walls, statues, products). The 12B A2 decoder gives the prose noticeably more fluency than 2B-class VLMs at the same task.
What this is not
- Not a fully instruction-tuned VLM. Visual instruction-following, VQA ("count the cats", "what color is the second item"), OCR-as-task, multi-turn visual reasoning, and tool calling on images are not trained here. Asked to do any of those, A3 will fall back to "describe."
- No safety / refusal tuning on visual inputs.
- No multi-image or video — single image per turn.
These are precisely what Artemis (Stage-2) is for.
What's next
- Artemis — Stage-2 multimodal instruction FFT on top of A3. Full FFT of the decoder + projector, frozen ViT, with heavy text rehearsal (A2 tool calling + i-DPO identity + reasoning data) to protect the underlying A2 capabilities through the visual instruction phase. The named flagship multimodal release.
Install
pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0'
The artemis-vlm package
contains the model class, processor, and data collator. On import, it
registers artemis_vlm with HuggingFace AutoConfig and
AutoModelForCausalLM so from_pretrained() resolves without
trust_remote_code.
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import artemis_vlm # registers ArtemisVLM with AutoConfig / AutoModel
model = AutoModelForCausalLM.from_pretrained(
"schneewolflabs/A3", dtype=torch.bfloat16,
).to("cuda").eval()
tok = AutoTokenizer.from_pretrained("schneewolflabs/A3")
processor = artemis_vlm.ArtemisVLMProcessor(
tokenizer=tok, vision_config=model.visual.config,
min_pixels=32 * 32, max_pixels=512 * 512,
)
# Qwen3 chat-template multimodal message
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
from PIL import Image
image = Image.open("your_image.jpg")
batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))
Architecture notes (Path B — composition not modification)
A3 grafts a Qwen3-VL vision tower onto an unmodified A2 text decoder via a learned 2-layer MLP projector. The decoder's vocabulary, weights, chat template, reasoning, tool calling, and identity are byte-identical to A2 text-only — the multimodal addition cannot regress text capability because the text computation path doesn't change.
Vision tokens enter through A2's repurposed reserved-token layout
(<|image_pad|> = token id 22 in the A-series Tekken vocab — see the A1
release notes for the full token-id allocation across <think>,
<tool_call>, vision, etc.). The Qwen tokenizer is never in the picture;
the projector bridges hidden spaces, not token spaces.
See the Artemis repo README
for the full architectural breakdown.
License
Apache 2.0. Same as A1, A2, A3-preview, and the underlying Qwen3-VL vision tower and BLIP3o-Pretrain-Long-Caption corpus.
Acknowledgements
- BLIP3o team for the Pretrain-Long-Caption corpus
- Qwen team for the Qwen3-VL vision encoder
- LLaVA project for the architectural template
- Schneewolf Labs Merlina + Artemis for the training and architecture infrastructure
— Schneewolf Labs · Project Artemis · Stage-1
- Downloads last month
- -
Model tree for schneewolflabs/A3
Base model
Qwen/Qwen3-VL-2B-Instruct