Awesome LLM Serving Papers

A curated list that categorizes the existing Large Language Model (LLM) Serving works. Star this repository, and you may gain some inspiration and contribute to the advancement of this research field.

LLM Serving

Scheduling

Orca: A Distributed Serving System for Transformer-Based Generative Models | OSDI 22
SpotServe: Serving Generative Large Language Models on Preemptible Instances | CMU
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference| ASPLOS' 24
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Fairness in Serving Large Language Models | OSDI' 24
Llumnix: Dynamic Scheduling for Large Language Model Serving| OSDI' 24
Efficient LLM Scheduling by Learning to Rank | UCSD

Memory/KVCache

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | SOSP' 23
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

Disaggregated

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | Moonshot
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool | Huawei

Resource-Constrained

FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU | ICML' 23
LLM in a flash: Efficient Large Language Model Inference with Limited Memory | Apple
PowerInfer-2: Fast Large Language Model Inference on a Smartphone | SJTU
LLMCad: Fast and Scalable On-device Large Language Model Inference
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining | ASPLOS 23

Heterogeneous

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing | ASPLOS'24
AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference | ASPLOS'24
IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System | ASPLOS'24
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | CMU
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity | UCB

Sparsity

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | VLDB' 24
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML' 23
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention | Microsoft
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | MIT
LazyLLM: DYNAMIC TOKEN PRUNING FOR EFFICIENT LONG CONTEXT LLM INFERENCE | Apple

Speculative Decoding

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification | CMU
Optimizing Speculative Decoding for Serving Large Language Models Using Goodput | UCB
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding | CMU
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding | CMU

Multiple LLM

MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
Punica: Multi-Tenant LoRA Serving

LLM Program/Agent

SGLang: Efficient Execution of Structured Language Model Programs | UCB
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation | Microsoft
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | NUS

Misc

Efficiently Scaling Transformer Inference | MLSys' 23
DeepSpeed Inference : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | Microsoft
Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys' 23
FlashDecoding++: Faster Large Language Model Inference on GPUs | Tsinghua
DeepSpeed-MII: Model Implementations for Inference (MII) ｜ Microsoft
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
APIServe: Efficient API Support for Large-Language Model Inferencing
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | NUS
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | PKU
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | Umich

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome LLM Serving Papers

LLM Serving

Scheduling

Memory/KVCache

Disaggregated

Resource-Constrained

Heterogeneous

Sparsity

Speculative Decoding

Multiple LLM

LLM Program/Agent

Misc

Other list

About

Releases

Packages

PDZZXL/Awesome-LLM-Serving

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Serving Papers

LLM Serving

Scheduling

Memory/KVCache

Disaggregated

Resource-Constrained

Heterogeneous

Sparsity

Speculative Decoding

Multiple LLM

LLM Program/Agent

Misc

Other list

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages