Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Observability for rayserve-vllm pattern #592

Closed
wants to merge 0 commits into from

Conversation

shivam-dubey-1
Copy link
Contributor

@shivam-dubey-1 shivam-dubey-1 commented Jul 17, 2024

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Motivation

#586

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

Accessing grafana dashbaord

@shivam-dubey-1 shivam-dubey-1 changed the title feat: #586 Observability for rayserve-vllm pattern using Prom and Grafana dashboards feat: observability for rayserve-vllm pattern using Prom and Grafana dashboards Jul 17, 2024
@shivam-dubey-1 shivam-dubey-1 changed the title feat: observability for rayserve-vllm pattern using Prom and Grafana dashboards feat: rayserve-vllm pattern observability Jul 17, 2024
@shivam-dubey-1 shivam-dubey-1 changed the title feat: rayserve-vllm pattern observability feat: Observability for rayserve-vllm pattern Jul 17, 2024
Copy link
Collaborator

@askulkarni2 askulkarni2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Couple of things,

  1. Please run pre-commit and fix any CI errors
  2. Can we include the vllm dashboard example as well. See https://github.com/vllm-project/vllm/tree/e76466dde2bc9525d55165ceaa600d298c7bf773/examples/production_monitoring

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for great work on this @shivam-dubey-1 ! 👏🏼 I have some minor comments otherwise PR looks good to me.

website deployment is failing. Please make sure to run npm serve start to verify the documentation at localhost:3000.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a source link to these files under the Website documentation? This will help users access the latest versions directly from the source if needed. We will not be maintaining these JSON files.

Copy link
Contributor Author

@shivam-dubey-1 shivam-dubey-1 Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, ray.io itself isn't maintaining the dashboards referenced in the current documentation (https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.html#step-10-access-grafana).

To ensure we're using the latest dashboards, I've included steps in our documentation on how to import them. This will provide the most up-to-date visualizations for Ray cluster.

Fixed website deployment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the monitoring folder and its files under ai-ml/jark-stack/terraform, and do the same for ray-dashboards folder.

gen-ai folder is only for serving scripts and ray templates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved monitoring from data-on-eks/gen-ai/inference/vllm-rayserve-gpu to data-on-eks/ai-ml/jark-stack/terraform

Copy link
Collaborator

@askulkarni2 askulkarni2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Minor changes.

Comment on lines 105 to 106
- name: RAY_GRAFANA_IFRAME_HOST
value: http://127.0.0.1:3000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed unless you are rendering the graphs on the Ray Dashboard.

Suggested change
- name: RAY_GRAFANA_IFRAME_HOST
value: http://127.0.0.1:3000

Comment on lines 421 to 422
- name: RAY_GRAFANA_IFRAME_HOST
value: http://127.0.0.1:3000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: RAY_GRAFANA_IFRAME_HOST
value: http://127.0.0.1:3000


These environment variables are crucial for enabling the embedding of Grafana panels within the Ray Dashboard and for proper communication between Ray, Grafana, and Prometheus:

- **RAY_GRAFANA_IFRAME_HOST** is used by your browser to fetch Grafana panels. It's set to 127.0.0.1:3000, which is typically used when accessing Grafana through port forwarding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **RAY_GRAFANA_IFRAME_HOST** is used by your browser to fetch Grafana panels. It's set to 127.0.0.1:3000, which is typically used when accessing Grafana through port forwarding.

Comment on lines 435 to 438
See [Configuring and Managing Ray Dashboard](https://docs.ray.io/en/latest/cluster/configure-manage-dashboard.html) for more details about these environment variables.
```text
Note: Since we forward the port of Grafana to 127.0.0.1:3000 in this example, we set RAY_GRAFANA_IFRAME_HOST to http://127.0.0.1:3000.
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See [Configuring and Managing Ray Dashboard](https://docs.ray.io/en/latest/cluster/configure-manage-dashboard.html) for more details about these environment variables.
```text
Note: Since we forward the port of Grafana to 127.0.0.1:3000 in this example, we set RAY_GRAFANA_IFRAME_HOST to http://127.0.0.1:3000.
```


kubectl port-forward deployment/kube-prometheus-stack-grafana -n kube-prometheus-stack 3000:3000

# Note: You need to update `RAY_GRAFANA_IFRAME_HOST` if you expose Grafana to a different port.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Note: You need to update `RAY_GRAFANA_IFRAME_HOST` if you expose Grafana to a different port.

- Create new dashboard by importing JSON file via Dashboards menu.
- Click 'Dashboards' icon in left panel, 'New', 'Import', then 'Upload JSON file'.
- Choose a JSON file.
- `Case 1:` If you are using Ray 2.24.0, you can use the sample config files in [GitHub repository](https://github.com/shivam-dubey-1/data-on-eks/tree/main/ai-ml/jark-stack/terraform/monitoring/ray-dashboards). The file names have a pattern of xxx_grafana_dashboard.json.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants