Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Abstract
Heterogeneous GPU-FPGA systems can accelerate large language model inference by offloading memory-intensive operations to FPGAs, achieving significant performance and energy improvements.
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that heterogeneous systems are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is 1.04sim2.2times faster and requires 1.11sim4.7times less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.
Community
Code will be released shortly.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill (2026)
- PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference (2026)
- Deep Kernel Fusion for Transformers (2026)
- SpecAttn: Co-Designing Sparse Attention with Self-Speculative Decoding (2026)
- CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference (2026)
- Pooling Engram Conditional Memory in Large Language Models using CXL (2026)
- SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.29002 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper