Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton #739

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ddynwzh1992
Copy link

@ddynwzh1992 ddynwzh1992 commented Feb 4, 2025

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction. When we triage the issues, we will add labels to the issue like "Enhancement", "Bug" which should indicate to you that this issue can be worked on and we are looking forward to your PR. We would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Add a ML blueprint to support ray serve with llama.cpp framework for model inference on AWS Graviton

Including following stuffs

ray-service-llamacpp.yaml -- create a Ray service
llamacpp-serve.py -- Ray serve python class with llama-cpp-python bind
perf_benchmark.go -- Benchmakr script with go routine
prompts.txt -- prompts example

Motivation

Contribute to GenAI on EKS

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

…raviton

ray-service-llamacpp.yaml -- Ray service yaml file
llamacpp-serve.py -- Ray serve python class with llama-cpp-python bind
perf_benchmark.go -- Benchmakr script with go routine
prompts.txt -- prompts example
@ddynwzh1992 ddynwzh1992 changed the title For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton [feat]For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton Feb 4, 2025
@ddynwzh1992 ddynwzh1992 changed the title [feat]For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton feat: For issue #706 Ray serve with Llama.cpp for CPU inference on Graviton Feb 4, 2025
Copy link
Contributor

@omrishiv omrishiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, thank you so much for adding this, it's a great example of using a different tool on a new instance type, I think it's going to be a great addition!

I left a few comments about formatting and cleanup which will make reviewing this PR a lot easier. I'd also like to remove things like pulling from other repos or building docker images if we can help it. Let's get those addressed and we can get another round through.

num_cpus: 29
runtime_env:
working_dir: "https://github.com/ddynwzh1992/ray-llm/archive/refs/heads/main.zip"
pip: ["llama_cpp_python", "transformers==4.46.0"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please freeze this dependency

ray_actor_options:
num_cpus: 29
runtime_env:
working_dir: "https://github.com/ddynwzh1992/ray-llm/archive/refs/heads/main.zip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this, please create a configmap of the llamacpp-serve.py file and add it to the head node pod for deployment. Please take a look at this PR for an example: https://github.com/awslabs/data-on-eks/pull/607/files

rayClusterConfig:
rayVersion: '2.33.0'
enableInTreeAutoscaling: true
#rayVersion: 3.0.0.dev0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove any commented out code

"io/ioutil"
"net/http"
"strings"
"os" // Add this import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format this file

# Get host CPU count
host_cpu_count = multiprocessing.cpu_count()

model = LLamaCPPDeployment.bind("host_cpu_count")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

@@ -0,0 +1,102 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove leading whitespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants