Kompress: ModernBERT Token Compressor for LLM Context Windows

Kompress compresses text in LLM context windows so agents can do more with less. It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.

Results

Model Quality Latency Size Params
kompress-base 7.9/10 84ms (MPS) 600MB 150M
kompress-small 7.4/10 13-29ms (ONNX) 279MB 70M
LLMLingua-2 5.9/10 117ms 710MB 179M

Quality on Real Agent Data (Claude-judged)

Eval Set kompress-base kompress-small LLMLingua-2
Unstructured NL text 7.9/10 7.4/10 5.9/10
Claude Code sessions 7.3/10 7.4/10 6.2/10

Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).

How It Works

Kompress is a dual-head ModernBERT model trained to classify each token as keep or discard:

  • Token head: Binary classifier (keep/discard per token via argmax)
  • Span head: 1D CNN that identifies important regions, boosts borderline tokens in critical spans

The model decides how much to compress based on content density — no fixed compression ratio.

Example

ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.

COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.

An LLM can fully understand and act on the compressed version.

Usage

from kompress.inference.pytorch_runner import KompressRunner

# Auto-downloads from HuggingFace on first use
runner = KompressRunner()

result = runner.compress("Your long text here...")
print(result.compressed)       # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved)      # Number of tokens saved

With Headroom (LLM Proxy)

pip install headroom-ai

Kompress is built into Headroom as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.

Training

Architecture

  • Base: answerdotai/ModernBERT-base (149M params, 8192 token context)
  • Token head: Linear(768, 2) — binary keep/discard classifier
  • Span head: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
  • Total: 150M params

Data

215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:

Dataset Count Type
LMSYS-Chat-1M 57K LLM conversations
CNN/DailyMail 50K News articles
WikiHow 50K How-to guides
MeetingBank 50K Meeting transcripts
XSum 47K News articles
GovReport 25K Government reports
ArXiv 25K Academic papers
SAMSum 14K Dialogues

Labeling Approach

Key insight: the labels must be strictly extractive — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).

The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.

Training Details

  • 3 epochs, batch size 32, learning rate 2e-5
  • BF16 mixed precision on NVIDIA H100
  • HuggingFace Trainer with warmup + cosine schedule
  • ~3 hours training time

Model Family

kompress-base kompress-small LLMLingua-2
Architecture ModernBERT 22-layer ModernBERT 6-layer (distilled) mBERT (2018)
Params 150M 70M 179M
Size 600MB 279MB (ONNX: 275MB) 710MB
Max context 8,192 tokens 8,192 tokens 512 tokens
Quality 7.9/10 7.4/10 5.9/10
Latency 84ms (MPS) 13-29ms (ONNX) 117ms
Training data 215K from 8 datasets Distilled from base 41K from MeetingBank
Labeling model Claude Sonnet 4.6 — GPT-4
Compression Content-adaptive Content-adaptive Fixed ratio

Limitations

  • English only (ModernBERT is English-focused)
  • Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
  • Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)

License

Apache 2.0

Downloads last month
81
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train chopratejas/kompress-base

Evaluation results