YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

APAC β€” Adaptive Paged Attention Controller for LLM Serving

A runtime control plane that sits above paged KV memory and continuously adjusts allocation, retention, offload, and routing policy according to observed request patterns.

Key Findings (Latest Results)

1. LRU Eviction is Near-Optimal for Single-Pool Prefix Sharing

Running BΓ©lΓ‘dy (hindsight-optimal) eviction against LRU under memory pressure:

  • Oracle gap: 0% across throughput, TTFT, and cache hit rate
  • Implication: APAC's value cannot come from better eviction alone

2. Multi-Pool Routing Offers Dramatic Latency Improvements

Comparing routing policies under memory pressure (2 pools, 128 blocks each, 500 requests):

Policy Throughput p99 TTFT Cache Hits
Oracle 89.7 rps 76 ms 11,468
Round-robin 87.5 rps 964 ms 61,226
Random 85.0 rps 1097 ms 67,687
Least-loaded 88.1 rps 1173 ms 38,525

Key result: Oracle routing reduces p99 TTFT by 92% (964ms β†’ 76ms)

The oracle routes each request to the pool with the best prefix cache coverage for that request's prefix. This maximizes cache reuse per-pool and dramatically reduces queueing delays.

3. Class-Based Routing Can Be Catastrophic

Routing all shared-prefix requests to a single "hot" pool:

  • p99 TTFT: 2278ms (vs 964ms for round-robin)
  • Throughput drops 2% from overload on the hot pool

Implication: Intelligent routing requires per-request prefix analysis, not coarse class-based heuristics.

Project Structure

apac/
β”œβ”€β”€ schemas/           # Event schemas (trace format, controller decisions)
β”œβ”€β”€ collectors/        # Live trace collectors (vLLM KV events, metrics)
β”œβ”€β”€ simulator/         # Discrete-event simulator for KV cache policies
β”‚   β”œβ”€β”€ core/          # DES engine, block pool, scheduler
β”‚   β”œβ”€β”€ workloads/     # Workload generators (BurstGPT, ShareGPT, synthetic)
β”‚   └── replay/        # Trace replay engine
β”œβ”€β”€ controller/        # APAC controller (policy engine, forecasters, routing)
β”œβ”€β”€ oracle/            # Hindsight-optimal oracle for gap analysis
β”œβ”€β”€ visualiser/        # Block heatmaps, timeline plots, residency histograms
β”œβ”€β”€ experiments/       # Experiment configs and scripts
└── tests/             # Unit and integration tests

Quick Start

pip install -e .
# Run oracle gap analysis on BurstGPT traces
python -m apac.experiments.oracle_gap --workload burstgpt --block-size 16 --num-blocks 2048

Key Concepts

  • Trace Collector: Subscribes to vLLM KV events via ZMQ and reconstructs block-level state
  • Simulator: Replays request traces through a paged block allocator with configurable policies
  • Oracle: Computes Belady-optimal (hindsight) decisions for routing, retention, and offload
  • Controller: Online APAC that makes decisions based on observable signals only

Datasets Used

  • BurstGPT β€” 5.29M rows, Azure OpenAI production traces
  • WildChat-1M β€” 1M timestamped conversations
  • ShareGPT52K β€” Length distribution reference

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for enfinity7B/apac