This project provides Kubernetes-based scripts for AI large language model (LLM) operations, including model downloading, inference service deployment, and performance benchmarking, enabling end-to-end AI workflows on Tencent Kubernetes Engine (TKE).
- Kubernetes cluster (recommended v1.28+)
- Tencent Cloud CFS storage (or compatible storage solution)
- GPU nodes (3 * H20 nodes used in this project)
Use the Model Download Utility to download models to CFS storage for reuse across inference services.
- dynamo: NVIDIA's distributed inference framework (open-sourced at GTC 2025), supporting multiple inference engines including TRT-LLM, vLLM, and SGLang.
Prerequisites:
- 1 x 8 GPU Node.
Deploys:
- 1 x VllmWorker (4 GPUs for decode phase)
- 4 x PrefillWorker (1 GPU each for prefill phase)
bash examples/dynamo/single-node/deploy.sh
[TODO]
- Script: test-openai-api
- Usage:
API_ENDPOINT=<your-api-endpoint> bash scripts/test-openai-api.sh
# Test localhost:8080
API_ENDPOINT=http://localhost:8080 bash scripts/test-openai-api.sh
- Script: tke-llm-downloader
- Usage:
LLM_MODEL=<model-id> PVC_NAME=<your-pvc> bash scripts/tke-llm-downloader.sh