llmaz (pronounced /lima:z/
), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.
🌱 llmaz is alpha now, so API may change before graduating to Beta.
- Easy of Use: People can quick deploy a LLM service with minimal configurations.
- Broad Backend Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, SGLang, llama.cpp. Find the full list of supported backends here.
- Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to support elastic scenarios.
- Accelerator Fungibility (WIP): llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- SOTA Inference: llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise(WIP) to run on Kubernetes.
- Various Model Providers: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores(aliyun OSS, more on the way). llmaz automatically handles the model loading requiring no effort from users.
- Multi-hosts Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 1.
- Kubernetes version >= 1.27
- Helm 3
helm repo add inftyai https://inftyai.github.io/llmaz
helm repo update
helm install llmaz inftyai/llmaz --namespace llmaz-system --create-namespace --version 0.0.3
helm uninstall llmaz
kubectl delete ns llmaz-system
If you want to delete the CRDs as well, run (ignore the error)
kubectl delete crd \
openmodels.llmaz.io \
backendruntimes.inference.llmaz.io \
playgrounds.inference.llmaz.io \
services.inference.llmaz.io