入门指南
注:⚡为基础必读,💎为基础选读,💡为进阶阅读
学习线上课程并完成作业,了解深度学习系统的基本组成,同时要求作业不能光做完就算了,注重运行效率的优化,比如跟PyTorch对比,是更快还是更慢
⚡
🎓 CMU DL Systems课程💎
🎓 System for AI
介绍分布式训练的blog post有很多,可以上网搜索其他相关内容进行初步了解,有一定基础知识储备后再看这两篇综述会更合适
💎
📄 GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding💎
📄 A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training💎
📄 FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models💎
📄 FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement
💎
📄 (Megatron-style SP) Reducing Activation Recomputation in Large Transformer Models💎
📄 (Ulyssys-style SP) DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models💎
📄 (RSA) Sequence parallelism: Long sequence training from system perspective💎
📄 (Ring-Attn) Ring Attention with Blockwise Transformers for Near-Infinite Context💎
📄 (LightSeq/DistFlashAttn) DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training
💎
📄 QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding💎
📄 Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training💎
📄 Don't Waste Your Bits! Squeeze Activations and Gradients for Deep Neural Networks via TinyScript💎
📄 PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
💡
📄 Striped Attention: Faster Ring Attention for Causal Transformers💡
📄 USP: A Unified Sequence Parallelism Approach for Long Context Generative AI💡
📄 LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism💡
📄 Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs
💡
📄 HetHub: A Heterogeneous distributed hybrid training system for large-scale models💡
📄 AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness💡
📄 AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators💡
📄 Whale: Scaling Deep Learning Model Training to the Trillions
⚡
🛠️ vLLM
分布式推理中也有用到很多并行策略,建议对前面关于并行策略的内容也进行阅读了解
💎
📄 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving💎
📄 Splitwise: Efficient generative LLM inference using phase splitting💎
📄 Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads💎
📄 Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving