Squeez
Squeeze out the juice, leave the pulp behind.

Squeez-2B

LLM coding agents spend 80-95% of their context window on irrelevant tool output. Squeez filters it down to the lines that actually matter, compressing tool output by ~91% while keeping 86% of the relevant information.

What is Squeez?

A tool output pruner for coding agents. When an agent runs a tool (pytest, grep, git log, npm build, kubectl, etc.), the output is often hundreds of lines but only a handful matter for the current task. Squeez sits between the tool and the agent's context window:

Tool output (500 lines) → Squeez → Relevant lines (30 lines) → Agent context

Existing context pruning tools (SWE-Pruner, Zilliz Semantic Highlight, Provence) are built for source code or document paragraphs. They don't handle the mixed, unstructured format of tool output (stack traces interleaved with passing tests, grep matches with context lines, build logs with timestamps).

This model is Qwen 3.5 2B fine-tuned to extract verbatim relevant lines from tool output given a task-specific query. It's trained specifically on 14 types of tool output from real SWE-bench workflows.

  • 2B parameters, runs on a single GPU, serves via vLLM
  • Outperforms Qwen 3.5 35B A3B zero-shot by +13% Span F1
  • Returns verbatim lines only, no rewriting or summarization
  • Works as a CLI pipe, Python library, or vLLM server

Evaluation

Evaluated on 617 held-out test samples from SWE-bench repositories, across 14 tool types:

Model Precision Recall F1 Compression
Squeez-2B 0.8043 0.8624 0.7895 0.9150
Qwen 3.5 35B A3B (zero-shot) 0.7402 0.7498 0.7000 0.9177
Kimi K2 (zero-shot) 0.6128 0.5286 0.5344 0.9425
Qwen 3.5 2B (untrained) 0.4154 0.5299 0.4075 0.8197
BM25 (10%) 0.1277 0.2172 0.1314 0.9036
First-N (10%) 0.0741 0.1445 0.0798 0.9055
Random (10%) 0.0738 0.1009 0.0697 0.9067
Last-N (10%) 0.0496 0.0503 0.0407 0.9130

Span-level precision, recall, and F1 measure strict line-level set overlap between predicted and gold relevant lines. Compression is the fraction of input removed.

Quick Start

With vLLM (recommended)

# Start the server
pip install vllm
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384

# Use from squeez CLI
pip install squeez
export SQUEEZ_SERVER_URL=http://localhost:8000/v1
cat output.txt | squeez "find the bug"

# Or pipe directly
python -m pytest tests/ -v 2>&1 | squeez "find the test failure related to authentication"

vLLM gives you batched inference, continuous batching, and high throughput — ideal when multiple agents or tools are running concurrently.

With squeez (local, no server)

pip install squeez

# Downloads and runs the model locally (no GPU server needed)
squeez "Find the failing traceback block" --input-file output.txt

Note: Local mode loads the model on every call. Fine for one-off use, but for repeated calls (e.g. an agent piping every tool through squeez), use vLLM — the model stays warm in memory.

With transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "KRLabsOrg/squeez-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": (
        "You prune verbose tool output for a coding agent. "
        "Given a focused extraction query and one tool output, return only the "
        "smallest verbatim evidence block(s) the agent should read next. "
        "Return the kept text inside <relevant_lines> tags. "
        "Do not rewrite, summarize, or invent lines."
    )},
    {"role": "user", "content": (
        "<query>\nFix the failing authentication test\n</query>\n"
        "<tool_output>\n"
        "PASSED tests/test_login.py::test_valid_credentials\n"
        "FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401\n"
        "PASSED tests/test_login.py::test_logout\n"
        "PASSED tests/test_login.py::test_rate_limiting\n"
        "\n</tool_output>"
    )},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Output:

<relevant_lines>
FAILED tests/test_login.py::test_token_refresh - AssertionError: expected 200 got 401
</relevant_lines>

Python API (with squeez)

from squeez.inference.extractor import ToolOutputExtractor

# Loads this model locally
extractor = ToolOutputExtractor(model_path="KRLabsOrg/squeez-2b")

# Or connect to a vLLM server
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")

filtered = extractor.extract(
    task="Find the referer validation block",
    tool_output=raw_output,
)
print(filtered)

Input / Output Format

Input — chat format with system prompt:

System: You prune verbose tool output for a coding agent. Given a focused
extraction query and one tool output, return only the smallest verbatim
evidence block(s) the agent should read next. Return the kept text inside
<relevant_lines> tags. Do not rewrite, summarize, or invent lines.

User: <query>{task_description}</query>
<tool_output>{raw_tool_output}</tool_output>

Output — verbatim relevant lines wrapped in XML:

<relevant_lines>
{only the lines that matter, copied verbatim}
</relevant_lines>

If no lines are relevant, the model returns empty tags: <relevant_lines>\n</relevant_lines>.

Supported Tool Types

The model was trained on 14 tool types from SWE-bench repositories:

Tool type Description Example
test_output pytest / unittest output Test failures, tracebacks, assertion errors
read_file File contents Source code, config files
grep Search results Pattern matches across files
git_diff Code changes Diffs between commits or branches
git_log Commit history Relevant commits
git_blame Line-level attribution Who changed what
ls Directory listings File structure
python Python REPL output Script output, errors
curl HTTP responses API responses, documentation
build_output Build logs Compilation errors, warnings
lint_output Linter output Style/type violations
pip_install Package manager output Dependency errors
type_check Type checker output mypy/pyright errors
coverage Coverage reports Uncovered lines

Training Details

Parameter Value
Base model Qwen/Qwen3.5-2B
Fine-tuning method LoRA (r=16, alpha=32) via Unsloth
Training data Squeez v3 — 10,508 samples from SWE-bench
Epochs 3 (best checkpoint at epoch 1.5)
Max sequence length 16,384 tokens
Learning rate 2e-4
Batch size 8 (effective 32 with 4x gradient accumulation)
Warmup 5% of steps
Weight decay 0.01
Checkpoint selection Best validation Span F1

Data generation

Training data was generated by running 14 types of tool calls on SWE-bench repositories and using a teacher model to label the relevant lines. Each sample contains:

  • A focused extraction query (what the agent needs to find)
  • Raw tool output (as the agent would see it)
  • Gold relevant lines (the minimal set the agent should read)

Dataset: KRLabsOrg/tool-output-extraction-swebench

Limitations

  • Trained primarily on Python/SWE-bench data — works best on software engineering tool output, though the prompt format generalizes to other domains
  • Not designed for general-purpose text summarization or question answering
  • Very short outputs (<5 lines) may be returned unchanged
  • Max input length is 16,384 tokens — longer outputs should be chunked

Use with coding agents

Add to your agent's system instructions (e.g. CLAUDE.md for Claude Code):

Always pipe shell commands through squeez and tell exactly what you want to know.

Examples:
- `bun test 2>&1 | squeez "did the tests pass?"`
- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
- `cat src/auth/middleware.py | squeez "find the referer validation logic"`

Do NOT use squeez when:
- You need exact, uncompressed output (e.g. writing a patch)
- The command is interactive

Citation

@software{kovacs2026squeez,
    title={Squeez: Compressing Tool Output for LLM Coding Agents},
    author={Adam Kovacs},
    year={2026},
    url={https://github.com/KRLabsOrg/squeez}
}

License

Apache 2.0

Acknowledgments

  • Qwen for the Qwen 3.5 2B base model
  • Unsloth for efficient LoRA training
  • SWE-bench for the evaluation framework and source repositories
  • Provence and SWE-Pruner for inspiration on context pruning approaches
Downloads last month
1,205
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KRLabsOrg/squeez-2b

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(46)
this model

Dataset used to train KRLabsOrg/squeez-2b

Paper for KRLabsOrg/squeez-2b