-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up dist_overview.rst #2913
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2913
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 618603c with merge base 22ae7b9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes looks good to me! Have a few suggestions inlined.
tutorial demonstrates how to combine DDP with RPC to train a model using | ||
distributed data parallelism combined with distributed model parallelism. | ||
|
||
|
||
PyTorch Distributed Developers | ||
------------------------------ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to clean up the https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md doc as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why RPC is no longer supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPC has not been actively developed on for a long time, and isn't being used by any of the recently developed distributed components. I am cleaning up the 'top level' overview to highlight the most active/relevant components.
3681e81
to
7ce679b
Compare
tutorial demonstrates how to combine DDP with RPC to train a model using | ||
distributed data parallelism combined with distributed model parallelism. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, suggesting to add more links in the end to the existing tutorials and other sources that would help users to get started. Could add under the See Also section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, that's a good point. i'm actually not all that familiar with the tutorials content we already have, but i suspect a lot of it is out of date. I may do a search for any important ones to add now, but i think we can also build that up more over time as we add more tutorials or update old ones.
1f0c64d
to
064c024
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding a heading for each section makes things look much nicer! Have a few more suggestions inlined after looked at the previewed doc page.
beginner_source/dist_overview.rst
Outdated
|
||
Data Parallel Training | ||
---------------------- | ||
``TorchElastic`` provides the widely used `torchrun <https://pytorch.org/docs/stable/elastic/run.html>` launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still want to brand it as TorchElastic? Or we should just say TorchRun?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc @kurman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for cleaning up and rewriting our overview documentation!
One small nit and it looks like the CI captured a scalining
typo
Many overdue updates * updating the overview to include TP/PP and DTensor/Devicemesh * removing RPC, DataParallel and Elastic as they are no longer supported
@wconstab Why torch.distributed.elastic is no longer supported |
I just tried to simplify the 'overview' doc to point to torchrun. torchrun is the main binary component of torchelastic, and we do support it. |
@svekars can you merge this at your convenience if it looks OK to you? |
Many overdue updates
Fixes #ISSUE_NUMBER
Description
Checklist