Instructions to use KRLabsOrg/squeez-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use KRLabsOrg/squeez-2b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="KRLabsOrg/squeez-2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("KRLabsOrg/squeez-2b")
model = AutoModelForMultimodalLM.from_pretrained("KRLabsOrg/squeez-2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use KRLabsOrg/squeez-2b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "KRLabsOrg/squeez-2b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/KRLabsOrg/squeez-2b

SGLang

How to use KRLabsOrg/squeez-2b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "KRLabsOrg/squeez-2b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "KRLabsOrg/squeez-2b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "KRLabsOrg/squeez-2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use KRLabsOrg/squeez-2b with Docker Model Runner:
```
docker model run hf.co/KRLabsOrg/squeez-2b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Squeez mascot

Squeez-2B

Squeez-2B is a 2B parameter model fine-tuned from Qwen 3.5 2B for task-conditioned tool-output pruning in coding agents. Given a focused query and one raw tool observation, it extracts the smallest verbatim evidence block the agent should inspect next — removing 92% of input tokens while retaining 0.86 recall.

Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context

Outperforms zero-shot Qwen 3.5 35B A3B by +11 recall points
Returns verbatim lines only (no rewriting or summarization)
Works as CLI pipe, Python library, or vLLM server
Trained on 27 tool types from real SWE-bench workflows and synthetic multi-ecosystem outputs

Resources: Paper | Dataset | Code & CLI | Blog post

Results

Evaluated on 618 manually curated held-out examples spanning 27 tool types.

Model	Prec.	Recall	F1	Compression
Squeez-2B	0.80	0.86	0.80	0.92
Qwen 3.5 35B A3B (zero-shot)	0.74	0.75	0.73	0.92
Kimi K2 (zero-shot)	0.61	0.53	0.68	0.94
Qwen 3.5 2B (untrained)	0.42	0.53	0.55	0.82

The fine-tuned 2B model is also the most precise system in the comparison, indicating it has learned a tool-specific extraction policy rather than relying on generic instruction following.

Qualitative patterns

Pattern	Example	Squeez-2B	Baseline failure
Precise selection	`git_log`, 21 lines — find one commit	Selects the single correct entry	Qwen 35B picks a plausible but wrong commit
Failure-block extraction	Service log, 176 lines — two similar TLS errors	Returns the correct 5-line block	Qwen 35B picks the wrong TLS error (different timestamp)
Correct empty prediction	`docker_logs`, 316 lines — no matching evidence	Returns empty output	Qwen 35B generates "No relevant lines found..."
Adjacent over-selection	Build output, 110 lines — Dockerfile error	Finds the right error + nearby noise	Qwen 35B misses the Dockerfile error entirely

On the 59 negative examples in the test set, Squeez-2B correctly returns empty output 80% of the time. Qwen 35B returns empty only 7% of the time.

Quick Start

CLI (recommended)

pip install squeez

# With vLLM server
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
export SQUEEZ_SERVER_URL=http://localhost:8000/v1

pytest -q 2>&1 | squeez "find the failure block"
git log --oneline -50 | squeez "find the commit that changed CSRF handling"
cat src/auth/middleware.py | squeez "find the referer validation logic"

Python API

from squeez.inference.extractor import ToolOutputExtractor

# vLLM server
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")

# Or local
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")

filtered = extractor.extract(
    task="Find the failing test block",
    tool_output=raw_output,
)

With transformers directly

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "KRLabsOrg/squeez-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": (
        "You prune verbose tool output for a coding agent. "
        "Given a focused extraction query and one tool output, return only the "
        "smallest verbatim evidence block(s) the agent should read next. "
        "Return the kept text inside <relevant_lines> tags. "
        "Do not rewrite, summarize, or invent lines."
    )},
    {"role": "user", "content": (
        "<query>\nFind the failing authentication test\n</query>\n"
        "<tool_output>\n"
        "PASSED tests/test_login.py::test_valid_credentials\n"
        "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
        "PASSED tests/test_login.py::test_logout\n"
        "</tool_output>"
    )},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
# <relevant_lines>
# FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
# </relevant_lines>

Input/Output Format

Input — Chat messages with system prompt:

System: extraction instructions (see above)
User: <query>{task}</query>\n<tool_output>{raw_output}</tool_output>

Output — Verbatim lines in XML tags:

<relevant_lines>
{only the lines that matter, copied verbatim}
</relevant_lines>

Supported Tool Types (27)

Training Details


Base model	Qwen/Qwen3.5-2B
Method	LoRA (r=16, alpha=32) via Unsloth
Training data	10,508 examples (SWE-bench + synthetic)
Epochs	3
Max sequence length	20,000 tokens
Learning rate	2e-4
Batch size	8 (32 effective with 4x gradient accumulation)
Hardware	Single NVIDIA A100 80GB
Dataset	KRLabsOrg/tool-output-extraction-swebench

Usage with Coding Agents

Add to your CLAUDE.md or agent system prompt:

When you invoke a shell command, pipe it through `squeez` and describe what you need.
Examples:
- bun test 2>&1 | squeez "did the tests pass?"
- git log --oneline -50 | squeez "find the commit that broke CSRF"
- cat src/auth/middleware.py | squeez "find the referer validation logic"

Smaller extractive alternative

If you don't need a generative model, swap in KRLabsOrg/verbatim-rag-modern-bert-v2 (150M ModernBERT span model). Same CLI:

export SQUEEZ_LOCAL_MODEL=KRLabsOrg/verbatim-rag-modern-bert-v2
pytest -q 2>&1 | squeez "find the failing test"

Head-to-head numbers are on the verbatim-rag-modern-bert-v2 model card.

Limitations

Best on software engineering tool output; not designed for general-purpose summarization
Synthetic data generated by openai/gpt-oss-120b — may not fully reflect real-world distributions for all ecosystems
Evaluates single tool observations, not full agent trajectories
Max input: 20,000 tokens (training length); can be extended at serving time

License

Apache 2.0

Citation

@misc{kovács2026squeeztaskconditionedtooloutputpruning,
      title={Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents}, 
      author={Ádám Kovács},
      year={2026},
      eprint={2604.04979},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2604.04979}, 
}