Release v0.4.8 · zhihu/ZhiLight

We first opensource release version💡

🎉🎉 Main Features

Asynchronous OpenAI compatible interface adapted from vllm
Custom defined tensor and unified global memory management
🔥 Encode and all-reduce overlap, we named "dual streams"
Host all-reduce based on SIMD instructions
Optimized fused kernels, qkv, residual & layernorm etc.
🔥 Fused batch attention for decoding based on tensor core instructions
Support TP and PP on one node, TP is recommended
Support dynamic batch
Support flashatten prefill
Support chunked prefill
Support prefix cache
Support Native INT8/SmoothQuant/FP8/AWQ/GPTQ quantization
Support Marlin kernel for GPTQ
Support MoE, DeepseekV2 MoE and DeepseekV2 MLA
Support Llama/Llama2, Mixtral, Qwen2 series and similar models

Docker Image

# Docker image
# CUDA: 12.4.1 Driver: 550.54.15及兼容版本
docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124
# CUDA: 12.5.1 Driver: 555.42.02及兼容版本
docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu125
# 以下Dockerfile可供参考，选用官方cuDNN镜像构建自己特定版本的镜像
# Dockerfile: https://github.com/zhihu/ZhiLight/blob/main/docker/Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.8

🎉🎉 Main Features

Docker Image