Kompress: ModernBERT Token Compressor for LLM Context Windows

Kompress compresses text in LLM context windows so agents can do more with less. It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.

Results

Model	Quality	Latency	Size	Params
kompress-base	7.9/10	84ms (MPS)	600MB	150M
kompress-small	7.4/10	13-29ms (ONNX)	279MB	70M
LLMLingua-2	5.9/10	117ms	710MB	179M

Quality on Real Agent Data (Claude-judged)

Eval Set	kompress-base	kompress-small	LLMLingua-2
Unstructured NL text	7.9/10	7.4/10	5.9/10
Claude Code sessions	7.3/10	7.4/10	6.2/10

Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).

How It Works

Kompress is a dual-head ModernBERT model trained to classify each token as keep or discard:

Token head: Binary classifier (keep/discard per token via argmax)
Span head: 1D CNN that identifies important regions, boosts borderline tokens in critical spans

The model decides how much to compress based on content density — no fixed compression ratio.

Example

ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.

COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.

An LLM can fully understand and act on the compressed version.

Usage

from kompress.inference.pytorch_runner import KompressRunner

# Auto-downloads from HuggingFace on first use
runner = KompressRunner()

result = runner.compress("Your long text here...")
print(result.compressed)       # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved)      # Number of tokens saved

With Headroom (LLM Proxy)

pip install headroom-ai

Kompress is built into Headroom as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.

Training

Architecture

Base: answerdotai/ModernBERT-base (149M params, 8192 token context)
Token head: Linear(768, 2) — binary keep/discard classifier
Span head: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
Total: 150M params

Data

215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:

Dataset	Count	Type
LMSYS-Chat-1M	57K	LLM conversations
CNN/DailyMail	50K	News articles
WikiHow	50K	How-to guides
MeetingBank	50K	Meeting transcripts
XSum	47K	News articles
GovReport	25K	Government reports
ArXiv	25K	Academic papers
SAMSum	14K	Dialogues

Labeling Approach

Key insight: the labels must be strictly extractive — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).

The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.

Training Details

3 epochs, batch size 32, learning rate 2e-5
BF16 mixed precision on NVIDIA H100
HuggingFace Trainer with warmup + cosine schedule
~3 hours training time

Model Family

	kompress-base	kompress-small	LLMLingua-2
Architecture	ModernBERT 22-layer	ModernBERT 6-layer (distilled)	mBERT (2018)
Params	150M	70M	179M
Size	600MB	279MB (ONNX: 275MB)	710MB
Max context	8,192 tokens	8,192 tokens	512 tokens
Quality	7.9/10	7.4/10	5.9/10
Latency	84ms (MPS)	13-29ms (ONNX)	117ms
Training data	215K from 8 datasets	Distilled from base	41K from MeetingBank
Labeling model	Claude Sonnet 4.6	—	GPT-4
Compression	Content-adaptive	Content-adaptive	Fixed ratio

Limitations

English only (ModernBERT is English-focused)
Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)

License

Apache 2.0

Downloads last month: 2,135

Safetensors

Model size

0.1B params

Tensor type

F32

Datasets used to train chopratejas/kompress-base

Evaluation results

Quality Score (Claude-judged)
self-reported

7.900
LLMLingua-2 Quality Score
self-reported

5.900
Latency (median, Apple Silicon MPS)
self-reported

84ms
LLMLingua-2 Latency
self-reported

117ms