A curated list that categorizes the existing Large Language Model (LLM) Serving works. Star this repository, and you may gain some inspiration and contribute to the advancement of this research field.
- Orca: A Distributed Serving System for Transformer-Based Generative Models | OSDI 22
- SpotServe: Serving Generative Large Language Models on Preemptible Instances | CMU
- ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference| ASPLOS' 24
- SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
- Fairness in Serving Large Language Models | OSDI' 24
- Llumnix: Dynamic Scheduling for Large Language Model Serving| OSDI' 24
- Efficient LLM Scheduling by Learning to Rank | UCSD
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | SOSP' 23
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
- DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | Moonshot
- MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool | Huawei
- FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU | ICML' 23
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory | Apple
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone | SJTU
- LLMCad: Fast and Scalable On-device Large Language Model Inference
- CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
- STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining | ASPLOS 23
- NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing | ASPLOS'24
- AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference | ASPLOS'24
- IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System | ASPLOS'24
- Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | CMU
- Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity | UCB
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | VLDB' 24
- Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML' 23
- MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention | Microsoft
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | MIT
- LazyLLM: DYNAMIC TOKEN PRUNING FOR EFFICIENT LONG CONTEXT LLM INFERENCE | Apple
- SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification | CMU
- Optimizing Speculative Decoding for Serving Large Language Models Using Goodput | UCB
- TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding | CMU
- MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding | CMU
- MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
- BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- Punica: Multi-Tenant LoRA Serving
- SGLang: Efficient Execution of Structured Language Model Programs | UCB
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation | Microsoft
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | NUS
- Efficiently Scaling Transformer Inference | MLSys' 23
- DeepSpeed Inference : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | Microsoft
- Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys' 23
- FlashDecoding++: Faster Large Language Model Inference on GPUs | Tsinghua
- DeepSpeed-MII: Model Implementations for Inference (MII) | Microsoft
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- APIServe: Efficient API Support for Large-Language Model Inferencing
- FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
- AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | NUS
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | PKU
- Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | Umich