Instructions to use Asystemoffields/OLMo-3-7B-Think-VAC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Asystemoffields/OLMo-3-7B-Think-VAC with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Asystemoffields/OLMo-3-7B-Think-VAC", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Asystemoffields/OLMo-3-7B-Think-VAC", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Asystemoffields/OLMo-3-7B-Think-VAC", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Asystemoffields/OLMo-3-7B-Think-VAC with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Asystemoffields/OLMo-3-7B-Think-VAC"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Asystemoffields/OLMo-3-7B-Think-VAC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Asystemoffields/OLMo-3-7B-Think-VAC

SGLang

How to use Asystemoffields/OLMo-3-7B-Think-VAC with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Asystemoffields/OLMo-3-7B-Think-VAC" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Asystemoffields/OLMo-3-7B-Think-VAC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Asystemoffields/OLMo-3-7B-Think-VAC" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Asystemoffields/OLMo-3-7B-Think-VAC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Asystemoffields/OLMo-3-7B-Think-VAC with Docker Model Runner:
```
docker model run hf.co/Asystemoffields/OLMo-3-7B-Think-VAC
```

OLMo-3-7B-Think-VAC

A structurally compressed version of OLMo-3-7B-Think using Variable Allocation Compression (VAC).

This model has the same architecture as OLMo-3-7B-Think but with each linear layer factorized into two smaller matrices, reducing storage by 1.8x and inference FLOPs by ~1.8x.

Property	Value
Base model	allenai/OLMo-3-7B-Think
Compression method	VAC (Variable Allocation Compression)
Compression ratio	1.8x
Download size	~8.9 GB (vs 14.6 GB original)
VRAM (bf16)	~8.9 GB (fits 12 GB GPUs)
VRAM (INT8)	~4.5 GB (fits 8 GB GPUs)
Inference speed	~1.8x faster than original
C4 PPL	26.97 (original: 21.05)

Usage

Requires transformers and trust_remote_code=True:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# bf16 — requires 12+ GB GPU (RTX 3080, 4070, A10G, etc.)
model = AutoModelForCausalLM.from_pretrained(
    "asystemoffields/OLMo-3-7B-Think-VAC",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# INT8 — requires 8+ GB GPU (RTX 3060, 4060, etc.)
# model = AutoModelForCausalLM.from_pretrained(
#     "asystemoffields/OLMo-3-7B-Think-VAC",
#     trust_remote_code=True,
#     load_in_8bit=True,
# )

tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-3-7B-Think")

messages = [{"role": "user", "content": "What is 38 + 47? Show your work."}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
)
output = model.generate(
    inputs.to(model.device),
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=False))

The model generates <think>...reasoning...</think> before its answer, just like the original OLMo-3-7B-Think. Set max_new_tokens to at least 1024 for complete responses (the thinking block can be long).

What is VAC?

Variable Allocation Compression replaces each dense linear layer with two smaller factor matrices (down and up), where W ≈ up @ down. The rank of each factorization is allocated per-matrix using Fisher information and a knapsack solver — important matrices get more rank, redundant ones get less.

The compression strategy was discovered by evolutionary search over compression order, Fisher scaling exponent, and per-component allocation. Key findings:

Middle-out compression order: compress easy middle layers first
Cube-root Fisher exponent: gentler than sqrt, avoids over-trusting the Fisher approximation
Attention-heavy allocation: attention tolerates 4x compression; MLP is a super sensitive component

How It Differs from Quantization

	Quantization (GPTQ, AWQ)	VAC
What it reduces	Bits per weight	Number of weights
FLOPs	Same as original	~1.8x fewer
Inference speed	Same (or slight bandwidth win)	~1.8x faster
Stacks with quant?	N/A	Yes (INT8 on factored weights)

VAC and quantization are orthogonal. You can quantize the factored matrices for additional savings.

Limitations

No GGUF/Ollama/LM Studio support. The factorized layer format is not supported by llama.cpp. This model runs via HuggingFace Transformers only.
Requires trust_remote_code=True — the factorized layer class is defined in modeling_pmre_olmo.py shipped with this repo.
~16 GB system RAM required for loading (model loads to CPU first, then moves to GPU).
~6 PPL gap from the original on C4 evaluation. For interactive use this is generally imperceptible, but may be measurable on precise benchmarks.

Method Details

Compression: Sequential Fisher-weighted SVD with evolved middle-out order and cube-root exponent
Recovery: Knowledge distillation on DOLMA (OLMo's training data) with 20% Think-completion interleave
Post-training: Dolci-Think-SFT replay (instruction tuning with <think> traces)
Attention tuning: Differential learning rate KD (attention at 10x higher LR than MLP) to recover routing quality

Full technical details: github.com/asystemoffields/v-a-c

Acknowledgments

Allen AI for OLMo-3-7B-Think and their commitment to open science — full training data (DOLMA), post-training data (Dolci), evaluation infrastructure (OLMES), and every intermediate checkpoint published openly.
Method: VAC (Variable Allocation Compression)

License

Apache 2.0 (same as the base model).

Downloads last month: 37

Safetensors

Model size

4B params

Tensor type

BF16

Collection including Asystemoffields/OLMo-3-7B-Think-VAC

VAC

Collection

Variable Allocation Compression model database: as lossless as I've been able to get while compressing billions of model parameters. • 1 item • Updated 12 days ago