💻 AI系统方向

入门指南

注:⚡为基础必读,💎为基础选读,💡为进阶阅读

AI基础入门

ML/DL、AIGC大模型基础

面向之前没有 ML/DL、AIGC大模型相关基础的同学

⚡ 📔 李宏毅机器学习
⚡ 📔 动手学深度学习
⚡ 📦 复旦大学LLM教材

ML/DL系统框架

ML/DL系统基础

学习线上课程并完成作业，了解深度学习系统的基本组成，同时要求作业不能光做完就算了，注重运行效率的优化，比如跟PyTorch对比，是更快还是更慢

⚡ 🎓 CMU DL Systems课程
💎 🎓 System for AI

DL系统框架

⚡ 🛠️ PyTorch
💡 🛠️ Jax
💡 🛠️ Hetu

分布式训练

（分布式）训练框架

调研学习现在主流的分布式训练系统，包括他们的系统框架实现，以及介绍主要技术的文章

⚡ 🛠️ Megatron-LM
⚡ 🛠️ DeepSpeed
💡 🛠️ Galvatron

分布式训练综述

介绍分布式训练的blog post有很多，可以上网搜索其他相关内容进行初步了解，有一定基础知识储备后再看这两篇综述会更合适

💎 📄 Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
💎 📄 Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

数据并行（All-Reduce based）

⚡ 📄 (Horovod) Fast and easy distributed deep learning in TensorFlow
⚡ 📄 (PyTorch DDP) PyTorch Distributed: Experiences on Accelerating Data Parallel Training

ZeRO/FSDP并行

⚡ 📄 ZeRO: memory optimizations toward training trillion parameter models
⚡ 📄 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
⚡ 📄 ZeRO-Offload: Democratizing Billion-Scale Model Training

流水并行

⚡ 📄 GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
⚡ 📄 PipeDream: Fast and Efficient Pipeline Parallel DNN Training
⚡ 📄 (PipeDream-2BW/PipeDream-Flush/1F1B-Flush) Memory-Efficient Pipeline-Parallel DNN Training

张量并行/算子并行

⚡ 📄 (OWT) One weird trick for parallelizing convolutional neural networks
⚡ 📄 (Megatron-style TP) Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

3D并行

⚡ 📄 Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

自动并行

⚡ 📄 (FlexFlow) Beyond Data and Model Parallelism for Deep Neural Networks
⚡ 📄 Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
⚡ 📄 Alpa: Automating Interand Intra-Operator Parallelism for Distributed Deep Learning

Tensor Annotation

💎 📄 GSPMD: General and Scalable Parallelization for ML Computation Graphs
💎 📄 Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization

专家并行/MoE训练

💎 📄 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
💎 📄 A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training
💎 📄 FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
💎 📄 FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement

序列并行/长窗口并行

💎 📄 (Megatron-style SP) Reducing Activation Recomputation in Large Transformer Models
💎 📄 (Ulyssys-style SP) DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
💎 📄 (RSA) Sequence parallelism: Long sequence training from system perspective
💎 📄 (Ring-Attn) Ring Attention with Blockwise Transformers for Near-Infinite Context
💎 📄 (LightSeq/DistFlashAttn) DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training

通信压缩

💎 📄 QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
💎 📄 Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
💎 📄 Don't Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript
💎 📄 PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

显存节约

💎 📄 (Activation Checkpointing/Recomputation) Training Deep Nets with Sublinear Memory Cost
💎 📄 (Offloading) vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

算子融合优化

💎 📄 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
💎 📄 FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

长序列训练

💡 📄 Striped Attention: Faster Ring Attention for Causal Transformers
💡 📄 USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
💡 📄 LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
💡 📄 Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs

多模态训练

💡 📄 DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models
💡 📄 Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
💡 📄 Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

异构训练

💡 📄 HetHub: A Heterogeneous distributed hybrid training system for large-scale models
💡 📄 AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness
💡 📄 AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators
💡 📄 Whale: Scaling Deep Learning Model Training to the Trillions

LLM推理服务

LLM服务系统框架

⚡ 🛠️ vLLM

综述

⚡ 📄 A Survey on Efficient Inference for Large Language Models
⚡ 📄 Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Batching

⚡ 📄 Orca: A Distributed Serving System for Transformer-Based Generative Models

模型并行

分布式推理中也有用到很多并行策略，建议对前面关于并行策略的内容也进行阅读了解

⚡ 📄 AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

显存管理

⚡ 📄 Efficient Memory Management for Large Language Model Serving with PagedAttention

投机推理

⚡ 📄 Fast Inference from Transformers via Speculative Decoding

Prefix Sharing

⚡ 📄 SGLang: Efficient Execution of Structured Language Model Programs

PD分离

💎 📄 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
💎 📄 Splitwise: Efficient generative LLM inference using phase splitting
💎 📄 Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
💎 📄 Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Chunked Prefill

💎 📄 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

模型压缩

💎 📄 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
💎 📄 AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
💎 📄 SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Diffusion（文生图、文生视频）推理服务

DiT系统框架

⚡ 🛠️ xDiT
⚡ 🛠️ VideoSys

分布式推理

⚡ 📄 DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
⚡ 📄 PipeFusion: Displaced Patch Pipeline Parallelism for Inference of Diffusion Transformer Models

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

💻 AI系统方向

AI基础入门

ML/DL、AIGC大模型基础

ML/DL系统框架

ML/DL系统基础

DL系统框架

分布式训练

（分布式）训练框架

分布式训练综述

数据并行（All-Reduce based）

ZeRO/FSDP并行

流水并行

张量并行/算子并行

3D并行

自动并行

Tensor Annotation

专家并行/MoE训练

序列并行/长窗口并行

通信压缩

显存节约

算子融合优化

长序列训练

多模态训练

异构训练

LLM推理服务

LLM服务系统框架

综述

Batching

模型并行

显存管理

投机推理

Prefix Sharing

PD分离

Chunked Prefill

模型压缩

Diffusion（文生图、文生视频）推理服务

DiT系统框架

分布式推理