You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
llmaz, pronounced as `/lima:z/`, aims to provide a production-ready inference platform for large language models on Kubernetes. It tightly integrates with state-of-the-art inference backends, such as [vLLM](https://github.com/vllm-project/vllm).
10
+
**llmaz** (pronounced `/lima:z/`), aims to provide a **Production-Ready** inference platform for large language models on **Kubernetes**. It closely integrates with state-of-the-art inference backends like [vLLM](https://github.com/vllm-project/vllm) to bring the cutting-edge researches to cloud.
11
11
12
12
## Concept
13
13
14
14

15
15
16
16
## Feature Overview
17
17
18
-
-**Easy to use**: People can deploy a production-ready LLM service with minimal configurations.
19
-
-**High performance**: llmaz integrates with vLLM by default for high performance inference. Other backend supports are on the way.
20
-
-**Autoscaling efficiency**: llmaz works smoothly with autoscaling components like [cluster-autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)and[Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
21
-
-**Accelerator fungibility**: llmaz supports serving LLMs with different accelerators for the sake of cost and performance.
22
-
-**SOTA inference technologies**: llmaz support the latest SOTA technologies like [speculative decoding](https://arxiv.org/abs/2211.17192) and [Splitwise](https://arxiv.org/abs/2311.18677).
18
+
-**User Friendly**: People can quick deploy a LLM service with minimal configurations.
19
+
-**High Performance**: llmaz integrates with vLLM by default for high performance inference. Other backends support are on the way.
20
+
-**Scaling Efficiency**: llmaz works smoothly with autoscaling components like [cluster-autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)or[Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic cases.
21
+
-**Accelerator Fungibility**: llmaz supports serving the same LLMs with various accelerators to optimize cost and performance.
22
+
-**SOTA Inference**: llmaz support the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) and [Splitwise](https://arxiv.org/abs/2311.18677).
23
23
24
24
## Quick Start
25
25
26
-
Once `Model`s (e.g. opt-125m) published, you can quick deploy a `Playground` for serving.
26
+
### Installation
27
27
28
-
### Model
28
+
Read the [Installation](./docs/installation.md) for guidance.
29
+
30
+
### Deploy
31
+
32
+
Once `Model`s (e.g. facebook/opt-125m) are published, you can quick deploy a `Playground` to serve the model.
33
+
34
+
#### Model
29
35
30
36
```yaml
31
37
apiVersion: llmaz.io/v1alpha1
@@ -37,12 +43,12 @@ spec:
37
43
dataSource:
38
44
modelID: facebook/opt-125m
39
45
inferenceFlavors:
40
-
- name: t4
46
+
- name: t4# GPU type
41
47
requests:
42
48
nvidia.com/gpu: 1
43
49
```
44
50
45
-
### Inference Playground
51
+
#### Inference Playground
46
52
47
53
```yaml
48
54
apiVersion: inference.llmaz.io/v1alpha1
@@ -55,16 +61,41 @@ spec:
55
61
modelName: opt-125m
56
62
```
57
63
58
-
Refer to more **[Examples](/docs/examples/README.md)** for references.
64
+
### Test
65
+
66
+
#### Expose the service
67
+
68
+
```cmd
69
+
kubectl port-forward pod/opt-125m-0 8080:8080
70
+
```
71
+
72
+
#### See registered models
73
+
74
+
```cmd
75
+
curl http://localhost:8080/v1/models
76
+
```
77
+
78
+
#### Request a query
79
+
80
+
```cmd
81
+
curl http://localhost:8080/v1/completions \
82
+
-H "Content-Type: application/json" \
83
+
-d '{
84
+
"model": "facebook/opt-125m",
85
+
"prompt": "San Francisco is a",
86
+
"max_tokens": 10,
87
+
"temperature": 0
88
+
}'
89
+
```
90
+
91
+
Refer to **[examples](/docs/examples/README.md)** to learn more.
59
92
60
93
## Roadmap
61
94
62
-
- Metrics support
63
-
- Autoscaling support
64
-
- Gateway support
65
-
- Serverless support
66
-
- CLI tool
67
-
- Model training, fine tuning in the long-term.
95
+
- Gateway support for traffic routing
96
+
- Serverless support for cloud-agnostic users
97
+
- CLI tool support
98
+
- Model training, fine tuning in the long-term
68
99
69
100
## Contributions
70
101
@@ -76,4 +107,4 @@ Refer to more **[Examples](/docs/examples/README.md)** for references.
0 commit comments