feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton #739

ddynwzh1992 · 2025-02-04T00:23:18Z

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction. When we triage the issues, we will add labels to the issue like "Enhancement", "Bug" which should indicate to you that this issue can be worked on and we are looking forward to your PR. We would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Add a ML blueprint to support ray serve with llama.cpp framework for model inference on AWS Graviton

Including following stuffs

ray-service-llamacpp.yaml -- create a Ray service
llamacpp-serve.py -- Ray serve python class with llama-cpp-python bind
perf_benchmark.go -- Benchmakr script with go routine
prompts.txt -- prompts example

Motivation

Contribute to GenAI on EKS

More

Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

E2E Test successfully complete before merge?

Additional Notes

…raviton ray-service-llamacpp.yaml -- Ray service yaml file llamacpp-serve.py -- Ray serve python class with llama-cpp-python bind perf_benchmark.go -- Benchmakr script with go routine prompts.txt -- prompts example

omrishiv

First off, thank you so much for adding this, it's a great example of using a different tool on a new instance type, I think it's going to be a great addition!

I left a few comments about formatting and cleanup which will make reviewing this PR a lot easier. I'd also like to remove things like pulling from other repos or building docker images if we can help it. Let's get those addressed and we can get another round through.

omrishiv · 2025-02-04T16:48:51Z

gen-ai/inference/llamacpp-rayserve-graviton/ray-service-llamacpp.yaml

+          num_cpus: 29
+      runtime_env:
+        working_dir: "https://github.com/ddynwzh1992/ray-llm/archive/refs/heads/main.zip"
+        pip: ["llama_cpp_python", "transformers==4.46.0"]


please freeze this dependency

omrishiv · 2025-02-04T16:52:49Z

gen-ai/inference/llamacpp-rayserve-graviton/ray-service-llamacpp.yaml

+        ray_actor_options:
+          num_cpus: 29
+      runtime_env:
+        working_dir: "https://github.com/ddynwzh1992/ray-llm/archive/refs/heads/main.zip"


instead of doing this, please create a configmap of the llamacpp-serve.py file and add it to the head node pod for deployment. Please take a look at this PR for an example: https://github.com/awslabs/data-on-eks/pull/607/files

omrishiv · 2025-02-04T16:53:03Z

gen-ai/inference/llamacpp-rayserve-graviton/ray-service-llamacpp.yaml

+  rayClusterConfig:
+    rayVersion: '2.33.0'
+    enableInTreeAutoscaling: true
+    #rayVersion: 3.0.0.dev0


Please remove any commented out code

omrishiv · 2025-02-04T16:53:23Z

gen-ai/inference/llamacpp-rayserve-graviton/perf_benchmark.go

+    "io/ioutil"
+    "net/http"
+    "strings"
+	"os"           // Add this import


Please format this file

omrishiv · 2025-02-04T16:53:30Z

gen-ai/inference/llamacpp-rayserve-graviton/llamacpp-serve.py

+# Get host CPU count
+host_cpu_count = multiprocessing.cpu_count()
+
+model = LLamaCPPDeployment.bind("host_cpu_count")


omrishiv · 2025-02-04T16:53:40Z

gen-ai/inference/llamacpp-rayserve-graviton/llamacpp-serve.py

@@ -0,0 +1,102 @@
+


please remove leading whitespace

ddynwzh1992 changed the title ~~For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton~~ [feat]For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton Feb 4, 2025

ddynwzh1992 changed the title ~~[feat]For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton~~ feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton Feb 4, 2025

omrishiv suggested changes Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton #739

feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton #739

ddynwzh1992 commented Feb 4, 2025 •

edited

Loading

omrishiv left a comment

omrishiv Feb 4, 2025

omrishiv Feb 4, 2025

omrishiv Feb 4, 2025

omrishiv Feb 4, 2025

omrishiv Feb 4, 2025

omrishiv Feb 4, 2025

feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton #739

Are you sure you want to change the base?

feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton #739

Conversation

ddynwzh1992 commented Feb 4, 2025 • edited Loading

What does this PR do?

Motivation

More

For Moderators

Additional Notes

omrishiv left a comment

Choose a reason for hiding this comment

omrishiv Feb 4, 2025

Choose a reason for hiding this comment

omrishiv Feb 4, 2025

Choose a reason for hiding this comment

omrishiv Feb 4, 2025

Choose a reason for hiding this comment

omrishiv Feb 4, 2025

Choose a reason for hiding this comment

omrishiv Feb 4, 2025

Choose a reason for hiding this comment

omrishiv Feb 4, 2025

Choose a reason for hiding this comment

ddynwzh1992 commented Feb 4, 2025 •

edited

Loading