Skip to content

v0.4.8

Latest
Compare
Choose a tag to compare
@unix1986 unix1986 released this 10 Dec 04:04
· 12 commits to main since this release
f8bdd3b

We first opensource release version💡

🎉🎉 Main Features

  • Asynchronous OpenAI compatible interface adapted from vllm
  • Custom defined tensor and unified global memory management
  • 🔥 Encode and all-reduce overlap, we named "dual streams"
  • Host all-reduce based on SIMD instructions
  • Optimized fused kernels, qkv, residual & layernorm etc.
  • 🔥 Fused batch attention for decoding based on tensor core instructions
  • Support TP and PP on one node, TP is recommended
  • Support dynamic batch
  • Support flashatten prefill
  • Support chunked prefill
  • Support prefix cache
  • Support Native INT8/SmoothQuant/FP8/AWQ/GPTQ quantization
  • Support Marlin kernel for GPTQ
  • Support MoE, DeepseekV2 MoE and DeepseekV2 MLA
  • Support Llama/Llama2, Mixtral, Qwen2 series and similar models

Docker Image

# Docker image
# CUDA: 12.4.1 Driver: 550.54.15及兼容版本
docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124
# CUDA: 12.5.1 Driver: 555.42.02及兼容版本
docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu125
# 以下Dockerfile可供参考,选用官方cuDNN镜像构建自己特定版本的镜像
# Dockerfile: https://github.com/zhihu/ZhiLight/blob/main/docker/Dockerfile