GitHub - InftyAI/llmaz at v0.0.6

2 Branches 11 Tags

Name	Name	Last commit message	Last commit date
Latest commit InftyAI-Agent Merge pull request #127 from kerthcet/cleanup/releasev0.6.0 Sep 6, 2024 fcc3543 · Sep 6, 2024 History 132 Commits
.github	.github	dependency cleanup	Aug 12, 2024
api	api	Change ModelClaims API	Sep 5, 2024
client-go	client-go	Change ModelClaims API	Sep 5, 2024
cmd	cmd	Change model name to github.com/inftyai/llmaz	Aug 26, 2024
config	config	Change ModelClaims API	Sep 5, 2024
docs	docs	Prepare for v0.0.6	Sep 6, 2024
hack	hack	Add model label to Playground	Aug 29, 2024
llmaz	llmaz	Prepare for v0.0.5	Sep 5, 2024
pkg	pkg	Prepare for v0.0.5	Sep 5, 2024
test	test	Change ModelClaims API	Sep 5, 2024
.dockerignore	.dockerignore	initial commit	Nov 23, 2023
.gitignore	.gitignore	Support llama.cpp	Aug 17, 2024
.golangci.yaml	.golangci.yaml	Support llama.cpp	Aug 17, 2024
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	Update contributing.md	Aug 13, 2024
CONTRIBUTING.md	CONTRIBUTING.md	Update contributing.md	Aug 13, 2024
Dockerfile	Dockerfile	[1/N] Add support for per model deployment	Jul 17, 2024
Dockerfile.loader	Dockerfile.loader	Support URI with aliyun oss	Aug 16, 2024
LICENSE	LICENSE	Initial commit	Nov 20, 2023
Makefile	Makefile	Support llama.cpp	Aug 17, 2024
OWNERS	OWNERS	Add OWNERS file	Jul 13, 2024
PROJECT	PROJECT	Add webhook to Model	Jul 15, 2024
README.md	README.md	Change ModelClaims API	Sep 5, 2024
go.mod	go.mod	Bump github.com/onsi/gomega from 1.34.1 to 1.34.2	Sep 2, 2024
go.sum	go.sum	Bump github.com/onsi/gomega from 1.34.1 to 1.34.2	Sep 2, 2024
poetry.lock	poetry.lock	Support llama.cpp	Aug 17, 2024
pyproject.toml	pyproject.toml	Support llama.cpp	Aug 17, 2024

Repository files navigation

Easy, advanced inference platform for large language models on Kubernetes

llmaz (pronounced /lima:z/), aims to provide a Production-Ready inference platform for large language models on Kubernetes. It closely integrates with the state-of-the-art inference backends to bring the leading-edge researches to cloud.

🌱 llmaz is alpha now, so API may change before graduating to Beta.

Concept

Feature Overview

Easy of Use: People can quick deploy a LLM service with minimal configurations.
Broad Backend Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, SGLang, llama.cpp. Find the full list of supported backends here.
Scaling Efficiency (WIP): llmaz works smoothly with autoscaling components like Cluster-Autoscaler or Karpenter to support elastic scenarios.
Accelerator Fungibility (WIP): llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
SOTA Inference: llmaz supports the latest cutting-edge researches like Speculative Decoding or Splitwise(WIP) to run on Kubernetes.
Various Model Providers: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores(aliyun OSS, more on the way). llmaz automatically handles the model loading requiring no effort from users.
Multi-hosts Support: llmaz supports both single-host and multi-hosts scenarios with LWS from day 1.

Quick Start

Installation

Read the Installation for guidance.

Deploy

Here's a simplest sample for deploying facebook/opt-125m, all you need to do is to apply a Model and a Playground.

Please refer to examples to learn more.

Note: if your model needs Huggingface token for weight downloads, please run kubectl create secret generic modelhub-secret --from-literal=HF_TOKEN=<your token> ahead.

Model

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: facebook/opt-125m
  inferenceFlavors:
  - name: t4 # GPU type
    requests:
      nvidia.com/gpu: 1

Inference Playground

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

Test

Expose the service

kubectl port-forward pod/opt-125m-0 8080:8080

Get registered models

curl http://localhost:8080/v1/models

Request a query

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 10,
    "temperature": 0
}'

Roadmap

Gateway support for traffic routing
Metrics support
Serverless support for cloud-agnostic users
CLI tool support
Model training, fine tuning in the long-term

Project Structures

llmaz # root
├── llmaz # where the model loader logic locates
├── pkg # where the main logic for Kubernetes controllers locates

Contributions

🚀 All kinds of contributions are welcomed ! Please follow Contributing. Thanks to all these contributors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Easy, advanced inference platform for large language models on Kubernetes

Concept

Feature Overview

Quick Start

Installation

Deploy

Model

Inference Playground

Test

Expose the service

Get registered models

Request a query

Roadmap

Project Structures

Contributions

About

Releases 11

Packages

Contributors 9

Languages

License

InftyAI/llmaz

Folders and files

Latest commit

History

Repository files navigation

Easy, advanced inference platform for large language models on Kubernetes

Concept

Feature Overview

Quick Start

Installation

Deploy

Model

Inference Playground

Test

Expose the service

Get registered models

Request a query

Roadmap

Project Structures

Contributions

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 9

Languages

Packages