Instructions to use Asystemoffields/OLMo-3-7B-Think-VAC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Asystemoffields/OLMo-3-7B-Think-VAC with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Asystemoffields/OLMo-3-7B-Think-VAC", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Asystemoffields/OLMo-3-7B-Think-VAC", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Asystemoffields/OLMo-3-7B-Think-VAC", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Asystemoffields/OLMo-3-7B-Think-VAC with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Asystemoffields/OLMo-3-7B-Think-VAC" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Asystemoffields/OLMo-3-7B-Think-VAC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Asystemoffields/OLMo-3-7B-Think-VAC
- SGLang
How to use Asystemoffields/OLMo-3-7B-Think-VAC with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Asystemoffields/OLMo-3-7B-Think-VAC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Asystemoffields/OLMo-3-7B-Think-VAC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Asystemoffields/OLMo-3-7B-Think-VAC" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Asystemoffields/OLMo-3-7B-Think-VAC", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Asystemoffields/OLMo-3-7B-Think-VAC with Docker Model Runner:
docker model run hf.co/Asystemoffields/OLMo-3-7B-Think-VAC
OLMo-3-7B-Think-VAC
A structurally compressed version of OLMo-3-7B-Think using Variable Allocation Compression (VAC).
This model has the same architecture as OLMo-3-7B-Think but with each linear layer factorized into two smaller matrices, reducing storage by 1.8x and inference FLOPs by ~1.8x.
| Property | Value |
|---|---|
| Base model | allenai/OLMo-3-7B-Think |
| Compression method | VAC (Variable Allocation Compression) |
| Compression ratio | 1.8x |
| Download size | ~8.9 GB (vs 14.6 GB original) |
| VRAM (bf16) | ~8.9 GB (fits 12 GB GPUs) |
| VRAM (INT8) | ~4.5 GB (fits 8 GB GPUs) |
| Inference speed | ~1.8x faster than original |
| C4 PPL | 26.97 (original: 21.05) |
Usage
Requires transformers and trust_remote_code=True:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# bf16 โ requires 12+ GB GPU (RTX 3080, 4070, A10G, etc.)
model = AutoModelForCausalLM.from_pretrained(
"asystemoffields/OLMo-3-7B-Think-VAC",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# INT8 โ requires 8+ GB GPU (RTX 3060, 4060, etc.)
# model = AutoModelForCausalLM.from_pretrained(
# "asystemoffields/OLMo-3-7B-Think-VAC",
# trust_remote_code=True,
# load_in_8bit=True,
# )
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-3-7B-Think")
messages = [{"role": "user", "content": "What is 38 + 47? Show your work."}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
)
output = model.generate(
inputs.to(model.device),
max_new_tokens=1024,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=False))
The model generates <think>...reasoning...</think> before its answer, just like the original OLMo-3-7B-Think. Set max_new_tokens to at least 1024 for complete responses (the thinking block can be long).
What is VAC?
Variable Allocation Compression replaces each dense linear layer with two smaller factor matrices (down and up), where W โ up @ down. The rank of each factorization is allocated per-matrix using Fisher information and a knapsack solver โ important matrices get more rank, redundant ones get less.
The compression strategy was discovered by evolutionary search over compression order, Fisher scaling exponent, and per-component allocation. Key findings:
- Middle-out compression order: compress easy middle layers first
- Cube-root Fisher exponent: gentler than sqrt, avoids over-trusting the Fisher approximation
- Attention-heavy allocation: attention tolerates 4x compression; MLP is a super sensitive component
How It Differs from Quantization
| Quantization (GPTQ, AWQ) | VAC | |
|---|---|---|
| What it reduces | Bits per weight | Number of weights |
| FLOPs | Same as original | ~1.8x fewer |
| Inference speed | Same (or slight bandwidth win) | ~1.8x faster |
| Stacks with quant? | N/A | Yes (INT8 on factored weights) |
VAC and quantization are orthogonal. You can quantize the factored matrices for additional savings.
Limitations
- No GGUF/Ollama/LM Studio support. The factorized layer format is not supported by llama.cpp. This model runs via HuggingFace Transformers only.
- Requires
trust_remote_code=Trueโ the factorized layer class is defined inmodeling_pmre_olmo.pyshipped with this repo. - ~16 GB system RAM required for loading (model loads to CPU first, then moves to GPU).
- ~6 PPL gap from the original on C4 evaluation. For interactive use this is generally imperceptible, but may be measurable on precise benchmarks.
Method Details
- Compression: Sequential Fisher-weighted SVD with evolved middle-out order and cube-root exponent
- Recovery: Knowledge distillation on DOLMA (OLMo's training data) with 20% Think-completion interleave
- Post-training: Dolci-Think-SFT replay (instruction tuning with
<think>traces) - Attention tuning: Differential learning rate KD (attention at 10x higher LR than MLP) to recover routing quality
Full technical details: github.com/asystemoffields/v-a-c
Acknowledgments
- Allen AI for OLMo-3-7B-Think and their commitment to open science โ full training data (DOLMA), post-training data (Dolci), evaluation infrastructure (OLMES), and every intermediate checkpoint published openly.
- Method: VAC (Variable Allocation Compression)
License
Apache 2.0 (same as the base model).
- Downloads last month
- 37