Releases: zhihu/ZhiLight
Releases · zhihu/ZhiLight
v0.4.8
We first opensource release version💡
🎉🎉 Main Features
- Asynchronous OpenAI compatible interface adapted from vllm
- Custom defined tensor and unified global memory management
- 🔥 Encode and all-reduce overlap, we named "dual streams"
- Host all-reduce based on SIMD instructions
- Optimized fused kernels, qkv, residual & layernorm etc.
- 🔥 Fused batch attention for decoding based on tensor core instructions
- Support TP and PP on one node, TP is recommended
- Support dynamic batch
- Support flashatten prefill
- Support chunked prefill
- Support prefix cache
- Support Native INT8/SmoothQuant/FP8/AWQ/GPTQ quantization
- Support Marlin kernel for GPTQ
- Support MoE, DeepseekV2 MoE and DeepseekV2 MLA
- Support Llama/Llama2, Mixtral, Qwen2 series and similar models
Docker Image
# Docker image
# CUDA: 12.4.1 Driver: 550.54.15及兼容版本
docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124
# CUDA: 12.5.1 Driver: 555.42.02及兼容版本
docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu125
# 以下Dockerfile可供参考,选用官方cuDNN镜像构建自己特定版本的镜像
# Dockerfile: https://github.com/zhihu/ZhiLight/blob/main/docker/Dockerfile