Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] [Example] Multi-node example #3398

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

MaoZiming
Copy link
Collaborator

@MaoZiming MaoZiming commented Apr 1, 2024

Added a YAML example for multi-node serving Llama-2-70b-hf, on 2 node A100:2.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@MaoZiming MaoZiming requested a review from cblmemo April 1, 2024 07:06
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the multi-node support @MaoZiming ! Left some comments. Also, it will be great to add a smoke test to show that multi-node is gang scheduling (maybe manually terminate one of them and see the serve controller's behaviour)

examples/serve/multi_node/multi_node.yaml Outdated Show resolved Hide resolved
cloud: gcp
ports: 8000
accelerators: A100:2
use_spot: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are using spot, maybe we should add a comment saying the multi-node is gang scheduling? actually I'm not sure if we should use spot here. cc @Michaelvll for a look

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both spot and non-spot should be gang-scheduled

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just want to say that spot is more error-prone

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, added a comment at the first line

examples/serve/multi_node/multi_node.yaml Outdated Show resolved Hide resolved
examples/serve/multi_node/multi_node.yaml Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants