YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
APAC β Adaptive Paged Attention Controller for LLM Serving
A runtime control plane that sits above paged KV memory and continuously adjusts allocation, retention, offload, and routing policy according to observed request patterns.
Key Findings (Latest Results)
1. LRU Eviction is Near-Optimal for Single-Pool Prefix Sharing
Running BΓ©lΓ‘dy (hindsight-optimal) eviction against LRU under memory pressure:
- Oracle gap: 0% across throughput, TTFT, and cache hit rate
- Implication: APAC's value cannot come from better eviction alone
2. Multi-Pool Routing Offers Dramatic Latency Improvements
Comparing routing policies under memory pressure (2 pools, 128 blocks each, 500 requests):
| Policy | Throughput | p99 TTFT | Cache Hits |
|---|---|---|---|
| Oracle | 89.7 rps | 76 ms | 11,468 |
| Round-robin | 87.5 rps | 964 ms | 61,226 |
| Random | 85.0 rps | 1097 ms | 67,687 |
| Least-loaded | 88.1 rps | 1173 ms | 38,525 |
Key result: Oracle routing reduces p99 TTFT by 92% (964ms β 76ms)
The oracle routes each request to the pool with the best prefix cache coverage for that request's prefix. This maximizes cache reuse per-pool and dramatically reduces queueing delays.
3. Class-Based Routing Can Be Catastrophic
Routing all shared-prefix requests to a single "hot" pool:
- p99 TTFT: 2278ms (vs 964ms for round-robin)
- Throughput drops 2% from overload on the hot pool
Implication: Intelligent routing requires per-request prefix analysis, not coarse class-based heuristics.
Project Structure
apac/
βββ schemas/ # Event schemas (trace format, controller decisions)
βββ collectors/ # Live trace collectors (vLLM KV events, metrics)
βββ simulator/ # Discrete-event simulator for KV cache policies
β βββ core/ # DES engine, block pool, scheduler
β βββ workloads/ # Workload generators (BurstGPT, ShareGPT, synthetic)
β βββ replay/ # Trace replay engine
βββ controller/ # APAC controller (policy engine, forecasters, routing)
βββ oracle/ # Hindsight-optimal oracle for gap analysis
βββ visualiser/ # Block heatmaps, timeline plots, residency histograms
βββ experiments/ # Experiment configs and scripts
βββ tests/ # Unit and integration tests
Quick Start
pip install -e .
# Run oracle gap analysis on BurstGPT traces
python -m apac.experiments.oracle_gap --workload burstgpt --block-size 16 --num-blocks 2048
Key Concepts
- Trace Collector: Subscribes to vLLM KV events via ZMQ and reconstructs block-level state
- Simulator: Replays request traces through a paged block allocator with configurable policies
- Oracle: Computes Belady-optimal (hindsight) decisions for routing, retention, and offload
- Controller: Online APAC that makes decisions based on observable signals only
Datasets Used
- BurstGPT β 5.29M rows, Azure OpenAI production traces
- WildChat-1M β 1M timestamped conversations
- ShareGPT52K β Length distribution reference
References
- PagedAttention (vLLM): arxiv:2309.06180
- BurstGPT: arxiv:2401.17644
- Splitwise: arxiv:2311.18677
- Ada-KV: arxiv:2407.11550
- LMCache: arxiv:2510.09665
- SGLang RadixAttention: arxiv:2312.07104