ternary-models: VLMs, Multimodal & Audio
Collection
Ternary-quantized models for architectures GGUF can't handle. tritplane3 scheme. โข 16 items โข Updated โข 2
How to use AsadIsmail/SmolVLM2-2.2B-Instruct-ternary with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="AsadIsmail/SmolVLM2-2.2B-Instruct-ternary") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("AsadIsmail/SmolVLM2-2.2B-Instruct-ternary", dtype="auto")How to use AsadIsmail/SmolVLM2-2.2B-Instruct-ternary with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AsadIsmail/SmolVLM2-2.2B-Instruct-ternary"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "AsadIsmail/SmolVLM2-2.2B-Instruct-ternary",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/AsadIsmail/SmolVLM2-2.2B-Instruct-ternary
How to use AsadIsmail/SmolVLM2-2.2B-Instruct-ternary with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "AsadIsmail/SmolVLM2-2.2B-Instruct-ternary" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "AsadIsmail/SmolVLM2-2.2B-Instruct-ternary",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "AsadIsmail/SmolVLM2-2.2B-Instruct-ternary" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "AsadIsmail/SmolVLM2-2.2B-Instruct-ternary",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use AsadIsmail/SmolVLM2-2.2B-Instruct-ternary with Docker Model Runner:
docker model run hf.co/AsadIsmail/SmolVLM2-2.2B-Instruct-ternary
Ternary-quantized version of HuggingFaceTB/SmolVLM2-2.2B-Instruct using ternary-quant.
Compact VLM designed for edge deployment, now even smaller with ternary quantization.
| Property | Value |
|---|---|
| Base Model | HuggingFaceTB/SmolVLM2-2.2B-Instruct |
| Parameters | 2.2B |
| Architecture | VLM (image + text) |
| Quantization | tritplane3 (169 layers, 10.92 effective bits) |
| Vision Encoder | FP16 (preserved) |
| Compression | 1.47x |
| Avg Reconstruction Error | 0.1236 |
| License | Apache 2.0 |
| Method | Size | VLM Support |
|---|---|---|
| FP16 (original) | ~4.4 GB | Yes |
| Ternary tritplane3 | 1.8 GB | Yes |
No GGUF alternative exists for SmolVLM2.
Validated during quantization (collapse score: 0.009 โ excellent):
| Test | Output |
|---|---|
| Image description (demo) | "A yellow circle with a diagonal line through it" (correct) |
| "What is machine learning?" | Correct, detailed explanation of ML, algorithms, training |
| "Explain gravity" | Accurate one-sentence explanation |
| Runtime | Min Memory | Hardware |
|---|---|---|
cached (CPU) |
~4 GB RAM | Any |
metal (Apple Silicon) |
~3 GB unified | M1+ |
cached (CUDA) |
~3 GB VRAM | Any NVIDIA GPU |
Ideal for edge deployment โ runs on devices with 4 GB RAM.
pip install ternary-quant
from ternary_quant.inference import load_ternary_model
model, processor = load_ternary_model(
"AsadIsmail/SmolVLM2-2.2B-Instruct-ternary",
runtime_mode="cached", device="auto"
)
inputs = processor(text="Describe this image", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(outputs[0], skip_special_tokens=True))
Part of ternary-models.
GitHub: github.com/Asad-Ismail/ternary-models | Library: github.com/Asad-Ismail/ternary-quant
Base model
HuggingFaceTB/SmolLM2-1.7B