APAC — Adaptive Paged Attention Controller for LLM Serving

A runtime control plane that sits above paged KV memory and continuously adjusts allocation, retention, offload, and routing policy according to observed request patterns.

Key Findings (Latest Results)

1. LRU Eviction is Near-Optimal for Single-Pool Prefix Sharing

Running Bélády (hindsight-optimal) eviction against LRU under memory pressure:

Oracle gap: 0% across throughput, TTFT, and cache hit rate
Implication: APAC's value cannot come from better eviction alone

2. Multi-Pool Routing Offers Dramatic Latency Improvements

Comparing routing policies under memory pressure (2 pools, 128 blocks each, 500 requests):

Policy	Throughput	p99 TTFT	Cache Hits
Oracle	89.7 rps	76 ms	11,468
Round-robin	87.5 rps	964 ms	61,226
Random	85.0 rps	1097 ms	67,687
Least-loaded	88.1 rps	1173 ms	38,525

Key result: Oracle routing reduces p99 TTFT by 92% (964ms → 76ms)

The oracle routes each request to the pool with the best prefix cache coverage for that request's prefix. This maximizes cache reuse per-pool and dramatically reduces queueing delays.

3. Class-Based Routing Can Be Catastrophic

Routing all shared-prefix requests to a single "hot" pool:

p99 TTFT: 2278ms (vs 964ms for round-robin)
Throughput drops 2% from overload on the hot pool

Implication: Intelligent routing requires per-request prefix analysis, not coarse class-based heuristics.

Project Structure

apac/
├── schemas/           # Event schemas (trace format, controller decisions)
├── collectors/        # Live trace collectors (vLLM KV events, metrics)
├── simulator/         # Discrete-event simulator for KV cache policies
│   ├── core/          # DES engine, block pool, scheduler
│   ├── workloads/     # Workload generators (BurstGPT, ShareGPT, synthetic)
│   └── replay/        # Trace replay engine
├── controller/        # APAC controller (policy engine, forecasters, routing)
├── oracle/            # Hindsight-optimal oracle for gap analysis
├── visualiser/        # Block heatmaps, timeline plots, residency histograms
├── experiments/       # Experiment configs and scripts
└── tests/             # Unit and integration tests

Quick Start

pip install -e .
# Run oracle gap analysis on BurstGPT traces
python -m apac.experiments.oracle_gap --workload burstgpt --block-size 16 --num-blocks 2048

Key Concepts

Trace Collector: Subscribes to vLLM KV events via ZMQ and reconstructs block-level state
Simulator: Replays request traces through a paged block allocator with configurable policies
Oracle: Computes Belady-optimal (hindsight) decisions for routing, retention, and offload
Controller: Online APAC that makes decisions based on observable signals only

Datasets Used

BurstGPT — 5.29M rows, Azure OpenAI production traces
WildChat-1M — 1M timestamped conversations
ShareGPT52K — Length distribution reference

References

PagedAttention (vLLM): arxiv:2309.06180
BurstGPT: arxiv:2401.17644
Splitwise: arxiv:2311.18677
Ada-KV: arxiv:2407.11550
LMCache: arxiv:2510.09665
SGLang RadixAttention: arxiv:2312.07104

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for enfinity7B/apac

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Paper • 2407.11550 • Published Oct 16, 2025