-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Observability for rayserve-vllm pattern #592
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Couple of things,
- Please run pre-commit and fix any CI errors
- Can we include the vllm dashboard example as well. See https://github.com/vllm-project/vllm/tree/e76466dde2bc9525d55165ceaa600d298c7bf773/examples/production_monitoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for great work on this @shivam-dubey-1 ! 👏🏼 I have some minor comments otherwise PR looks good to me.
website deployment is failing. Please make sure to run npm serve start to verify the documentation at localhost:3000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a source link to these files under the Website documentation? This will help users access the latest versions directly from the source if needed. We will not be maintaining these JSON files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, ray.io itself isn't maintaining the dashboards referenced in the current documentation (https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.html#step-10-access-grafana).
To ensure we're using the latest dashboards, I've included steps in our documentation on how to import them. This will provide the most up-to-date visualizations for Ray cluster.
Fixed website deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the monitoring
folder and its files under ai-ml/jark-stack/terraform
, and do the same for ray-dashboards
folder.
gen-ai
folder is only for serving scripts and ray templates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved monitoring from data-on-eks/gen-ai/inference/vllm-rayserve-gpu
to data-on-eks/ai-ml/jark-stack/terraform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Minor changes.
- name: RAY_GRAFANA_IFRAME_HOST | ||
value: http://127.0.0.1:3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed unless you are rendering the graphs on the Ray Dashboard.
- name: RAY_GRAFANA_IFRAME_HOST | |
value: http://127.0.0.1:3000 |
- name: RAY_GRAFANA_IFRAME_HOST | ||
value: http://127.0.0.1:3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name: RAY_GRAFANA_IFRAME_HOST | |
value: http://127.0.0.1:3000 |
|
||
These environment variables are crucial for enabling the embedding of Grafana panels within the Ray Dashboard and for proper communication between Ray, Grafana, and Prometheus: | ||
|
||
- **RAY_GRAFANA_IFRAME_HOST** is used by your browser to fetch Grafana panels. It's set to 127.0.0.1:3000, which is typically used when accessing Grafana through port forwarding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **RAY_GRAFANA_IFRAME_HOST** is used by your browser to fetch Grafana panels. It's set to 127.0.0.1:3000, which is typically used when accessing Grafana through port forwarding. |
See [Configuring and Managing Ray Dashboard](https://docs.ray.io/en/latest/cluster/configure-manage-dashboard.html) for more details about these environment variables. | ||
```text | ||
Note: Since we forward the port of Grafana to 127.0.0.1:3000 in this example, we set RAY_GRAFANA_IFRAME_HOST to http://127.0.0.1:3000. | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See [Configuring and Managing Ray Dashboard](https://docs.ray.io/en/latest/cluster/configure-manage-dashboard.html) for more details about these environment variables. | |
```text | |
Note: Since we forward the port of Grafana to 127.0.0.1:3000 in this example, we set RAY_GRAFANA_IFRAME_HOST to http://127.0.0.1:3000. | |
``` |
|
||
kubectl port-forward deployment/kube-prometheus-stack-grafana -n kube-prometheus-stack 3000:3000 | ||
|
||
# Note: You need to update `RAY_GRAFANA_IFRAME_HOST` if you expose Grafana to a different port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Note: You need to update `RAY_GRAFANA_IFRAME_HOST` if you expose Grafana to a different port. |
- Create new dashboard by importing JSON file via Dashboards menu. | ||
- Click 'Dashboards' icon in left panel, 'New', 'Import', then 'Upload JSON file'. | ||
- Choose a JSON file. | ||
- `Case 1:` If you are using Ray 2.24.0, you can use the sample config files in [GitHub repository](https://github.com/shivam-dubey-1/data-on-eks/tree/main/ai-ml/jark-stack/terraform/monitoring/ray-dashboards). The file names have a pattern of xxx_grafana_dashboard.json. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix link
What does this PR do?
🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.
Motivation
#586
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes
Accessing grafana dashbaord