Squeeze out the juice, leave the pulp behind.
Squeez-2B
LLM coding agents spend 80-95% of their context window on irrelevant tool output. Squeez filters it down to the lines that actually matter, compressing tool output by ~91% while keeping 86% of the relevant information.
What is Squeez?
A tool output pruner for coding agents. When an agent runs a tool (pytest, grep, git log, npm build, kubectl, etc.), the output is often hundreds of lines but only a handful matter for the current task. Squeez sits between the tool and the agent's context window:
Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context
Existing context pruning tools (SWE-Pruner, Zilliz Semantic Highlight, Provence) are built for source code or document paragraphs. They don't handle the mixed, unstructured format of tool output (stack traces interleaved with passing tests, grep matches with context lines, build logs with timestamps).
This model is Qwen 3.5 2B fine-tuned to extract verbatim relevant lines from tool output given a task-specific query. It's trained specifically on 14 types of tool output from real SWE-bench workflows.
- 2B parameters, runs on a single GPU, serves via vLLM
- Outperforms Qwen 3.5 35B A3B zero-shot by +13% Span F1
- Returns verbatim lines only, no rewriting or summarization
- Works as a CLI pipe, Python library, or vLLM server
Evaluation
Evaluated on 617 held-out test samples from SWE-bench repositories, across 14 tool types:
| Model | Precision | Recall | F1 | Compression |
|---|---|---|---|---|
| Squeez-2B | 0.8043 | 0.8624 | 0.7895 | 0.9150 |
| Qwen 3.5 35B A3B (zero-shot) | 0.7402 | 0.7498 | 0.7000 | 0.9177 |
| Kimi K2 (zero-shot) | 0.6128 | 0.5286 | 0.5344 | 0.9425 |
| Qwen 3.5 2B (untrained) | 0.4154 | 0.5299 | 0.4075 | 0.8197 |
| BM25 (10%) | 0.1277 | 0.2172 | 0.1314 | 0.9036 |
| First-N (10%) | 0.0741 | 0.1445 | 0.0798 | 0.9055 |
| Random (10%) | 0.0738 | 0.1009 | 0.0697 | 0.9067 |
| Last-N (10%) | 0.0496 | 0.0503 | 0.0407 | 0.9130 |
Span-level precision, recall, and F1 measure strict line-level set overlap between predicted and gold relevant lines. Compression is the fraction of input removed.
Quick Start
With vLLM (recommended)
# Start the server
pip install vllm
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
# Use from squeez CLI
pip install squeez
export SQUEEZ_SERVER_URL=http://localhost:8000/v1
cat output.txt | squeez "find the bug"
# Or pipe directly
python -m pytest tests/ -v 2>&1 | squeez "find the test failure related to authentication"
vLLM gives you batched inference, continuous batching, and high throughput — ideal when multiple agents or tools are running concurrently.
With squeez (local, no server)
pip install squeez
# Downloads and runs the model locally (no GPU server needed)
squeez "Find the failing traceback block" --input-file output.txt
Note: Local mode loads the model on every call. Fine for one-off use, but for repeated calls (e.g. an agent piping every tool through squeez), use vLLM — the model stays warm in memory.
With transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "KRLabsOrg/squeez-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "system", "content": (
"You prune verbose tool output for a coding agent. "
"Given a focused extraction query and one tool output, return only the "
"smallest verbatim evidence block(s) the agent should read next. "
"Return the kept text inside <relevant_lines> tags. "
"Do not rewrite, summarize, or invent lines."
)},
{"role": "user", "content": (
"<query>\nFix the failing authentication test\n</query>\n"
"<tool_output>\n"
"PASSED tests/test_login.py::test_valid_credentials\n"
"FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
"PASSED tests/test_login.py::test_logout\n"
"PASSED tests/test_login.py::test_rate_limiting\n"
"\n</tool_output>"
)},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
Output:
<relevant_lines>
FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
</relevant_lines>
Python API (with squeez)
from squeez.inference.extractor import ToolOutputExtractor
# Loads this model locally
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")
# Or connect to a vLLM server
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
filtered = extractor.extract(
task="Find the referer validation block",
tool_output=raw_output,
)
print(filtered)
Input / Output Format
Input — chat format with system prompt:
System: You prune verbose tool output for a coding agent. Given a focused
extraction query and one tool output, return only the smallest verbatim
evidence block(s) the agent should read next. Return the kept text inside
<relevant_lines> tags. Do not rewrite, summarize, or invent lines.
User: <query>{task_description}</query>
<tool_output>{raw_tool_output}</tool_output>
Output — verbatim relevant lines wrapped in XML:
<relevant_lines>
{only the lines that matter, copied verbatim}
</relevant_lines>
If no lines are relevant, the model returns empty tags: <relevant_lines>\n</relevant_lines>.
Supported Tool Types
The model was trained on 14 tool types from SWE-bench repositories:
| Tool type | Description | Example |
|---|---|---|
test_output |
pytest / unittest output | Test failures, tracebacks, assertion errors |
read_file |
File contents | Source code, config files |
grep |
Search results | Pattern matches across files |
git_diff |
Code changes | Diffs between commits or branches |
git_log |
Commit history | Relevant commits |
git_blame |
Line-level attribution | Who changed what |
ls |
Directory listings | File structure |
python |
Python REPL output | Script output, errors |
curl |
HTTP responses | API responses, documentation |
build_output |
Build logs | Compilation errors, warnings |
lint_output |
Linter output | Style/type violations |
pip_install |
Package manager output | Dependency errors |
type_check |
Type checker output | mypy/pyright errors |
coverage |
Coverage reports | Uncovered lines |
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3.5-2B |
| Fine-tuning method | LoRA (r=16, alpha=32) via Unsloth |
| Training data | Squeez v3 — 10,508 samples from SWE-bench |
| Epochs | 3 (best checkpoint at epoch 1.5) |
| Max sequence length | 16,384 tokens |
| Learning rate | 2e-4 |
| Batch size | 8 (effective 32 with 4x gradient accumulation) |
| Warmup | 5% of steps |
| Weight decay | 0.01 |
| Checkpoint selection | Best validation Span F1 |
Data generation
Training data was generated by running 14 types of tool calls on SWE-bench repositories and using a teacher model to label the relevant lines. Each sample contains:
- A focused extraction query (what the agent needs to find)
- Raw tool output (as the agent would see it)
- Gold relevant lines (the minimal set the agent should read)
Dataset: KRLabsOrg/tool-output-extraction-swebench
Limitations
- Trained primarily on Python/SWE-bench data — works best on software engineering tool output, though the prompt format generalizes to other domains
- Not designed for general-purpose text summarization or question answering
- Very short outputs (<5 lines) may be returned unchanged
- Max input length is 16,384 tokens — longer outputs should be chunked
Use with coding agents
Add to your agent's system instructions (e.g. CLAUDE.md for Claude Code):
Always pipe shell commands through squeez and tell exactly what you want to know.
Examples:
- `bun test 2>&1 | squeez "did the tests pass?"`
- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
- `cat src/auth/middleware.py | squeez "find the referer validation logic"`
Do NOT use squeez when:
- You need exact, uncompressed output (e.g. writing a patch)
- The command is interactive
Citation
@software{kovacs2026squeez,
title={Squeez: Compressing Tool Output for LLM Coding Agents},
author={Adam Kovacs},
year={2026},
url={https://github.com/KRLabsOrg/squeez}
}
License
Apache 2.0
Acknowledgments
- Qwen for the Qwen 3.5 2B base model
- Unsloth for efficient LoRA training
- SWE-bench for the evaluation framework and source repositories
- Provence and SWE-Pruner for inspiration on context pruning approaches
- Downloads last month
- 1,205