Kompress: ModernBERT Token Compressor for LLM Context Windows
Kompress compresses text in LLM context windows so agents can do more with less. It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.
Results
| Model | Quality | Latency | Size | Params |
|---|---|---|---|---|
| kompress-base | 7.9/10 | 84ms (MPS) | 600MB | 150M |
| kompress-small | 7.4/10 | 13-29ms (ONNX) | 279MB | 70M |
| LLMLingua-2 | 5.9/10 | 117ms | 710MB | 179M |
Quality on Real Agent Data (Claude-judged)
| Eval Set | kompress-base | kompress-small | LLMLingua-2 |
|---|---|---|---|
| Unstructured NL text | 7.9/10 | 7.4/10 | 5.9/10 |
| Claude Code sessions | 7.3/10 | 7.4/10 | 6.2/10 |
Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).
How It Works
Kompress is a dual-head ModernBERT model trained to classify each token as keep or discard:
- Token head: Binary classifier (keep/discard per token via argmax)
- Span head: 1D CNN that identifies important regions, boosts borderline tokens in critical spans
The model decides how much to compress based on content density — no fixed compression ratio.
Example
ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.
COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.
An LLM can fully understand and act on the compressed version.
Usage
from kompress.inference.pytorch_runner import KompressRunner
# Auto-downloads from HuggingFace on first use
runner = KompressRunner()
result = runner.compress("Your long text here...")
print(result.compressed) # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved) # Number of tokens saved
With Headroom (LLM Proxy)
pip install headroom-ai
Kompress is built into Headroom as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.
Training
Architecture
- Base:
answerdotai/ModernBERT-base(149M params, 8192 token context) - Token head: Linear(768, 2) — binary keep/discard classifier
- Span head: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
- Total: 150M params
Data
215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:
| Dataset | Count | Type |
|---|---|---|
| LMSYS-Chat-1M | 57K | LLM conversations |
| CNN/DailyMail | 50K | News articles |
| WikiHow | 50K | How-to guides |
| MeetingBank | 50K | Meeting transcripts |
| XSum | 47K | News articles |
| GovReport | 25K | Government reports |
| ArXiv | 25K | Academic papers |
| SAMSum | 14K | Dialogues |
Labeling Approach
Key insight: the labels must be strictly extractive — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).
The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.
Training Details
- 3 epochs, batch size 32, learning rate 2e-5
- BF16 mixed precision on NVIDIA H100
- HuggingFace Trainer with warmup + cosine schedule
- ~3 hours training time
Model Family
| kompress-base | kompress-small | LLMLingua-2 | |
|---|---|---|---|
| Architecture | ModernBERT 22-layer | ModernBERT 6-layer (distilled) | mBERT (2018) |
| Params | 150M | 70M | 179M |
| Size | 600MB | 279MB (ONNX: 275MB) | 710MB |
| Max context | 8,192 tokens | 8,192 tokens | 512 tokens |
| Quality | 7.9/10 | 7.4/10 | 5.9/10 |
| Latency | 84ms (MPS) | 13-29ms (ONNX) | 117ms |
| Training data | 215K from 8 datasets | Distilled from base | 41K from MeetingBank |
| Labeling model | Claude Sonnet 4.6 | — | GPT-4 |
| Compression | Content-adaptive | Content-adaptive | Fixed ratio |
Limitations
- English only (ModernBERT is English-focused)
- Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
- Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)
License
Apache 2.0
- Downloads last month
- 81
Datasets used to train chopratejas/kompress-base
Evaluation results
- Quality Score (Claude-judged)self-reported7.900
- LLMLingua-2 Quality Scoreself-reported5.900
- Latency (median, Apple Silicon MPS)self-reported84ms
- LLMLingua-2 Latencyself-reported117ms