Skip to content

Large Language Model (LLM) Serving Paper and Resource List

Notifications You must be signed in to change notification settings

PDZZXL/Awesome-LLM-Serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

Awesome LLM Serving Papers

A curated list that categorizes the existing Large Language Model (LLM) Serving works. Star this repository, and you may gain some inspiration and contribute to the advancement of this research field.

LLM Serving

Scheduling

Memory/KVCache

Disaggregated

  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
  • Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
  • Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | Moonshot
  • MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool | Huawei

Resource-Constrained

Heterogeneous

  • NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing | ASPLOS'24
  • AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference | ASPLOS'24
  • IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System | ASPLOS'24
  • Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | CMU
  • Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity | UCB

Sparsity

  • Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | VLDB' 24
  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time | ICML' 23
  • MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention | Microsoft
  • QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | MIT
  • LazyLLM: DYNAMIC TOKEN PRUNING FOR EFFICIENT LONG CONTEXT LLM INFERENCE | Apple

Speculative Decoding

Multiple LLM

  • MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
  • BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
  • Punica: Multi-Tenant LoRA Serving

LLM Program/Agent

Misc

  • Efficiently Scaling Transformer Inference | MLSys' 23
  • DeepSpeed Inference : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
  • SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | Microsoft
  • Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys' 23
  • FlashDecoding++: Faster Large Language Model Inference on GPUs | Tsinghua
  • DeepSpeed-MII: Model Implementations for Inference (MII) | Microsoft
  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters
  • Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
  • APIServe: Efficient API Support for Large-Language Model Inferencing
  • FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
  • AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | NUS
  • LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | PKU
  • Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | Umich

Other list

About

Large Language Model (LLM) Serving Paper and Resource List

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published