Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up dist_overview.rst #2913

Merged
merged 1 commit into from
Jun 13, 2024
Merged

Clean up dist_overview.rst #2913

merged 1 commit into from
Jun 13, 2024

Conversation

wconstab
Copy link
Contributor

@wconstab wconstab commented Jun 7, 2024

Many overdue updates

  • updating the overview to include TP/PP and DTensor/Devicemesh
  • removing RPC, DataParallel as they are no longer supported

Fixes #ISSUE_NUMBER

Description

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

Copy link

pytorch-bot bot commented Jun 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2913

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 618603c with merge base 22ae7b9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good to me! Have a few suggestions inlined.

tutorial demonstrates how to combine DDP with RPC to train a model using
distributed data parallelism combined with distributed model parallelism.


PyTorch Distributed Developers
------------------------------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why RPC is no longer supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPC has not been actively developed on for a long time, and isn't being used by any of the recently developed distributed components. I am cleaning up the 'top level' overview to highlight the most active/relevant components.

@wconstab wconstab force-pushed the whc/cleanup branch 2 times, most recently from 3681e81 to 7ce679b Compare June 10, 2024 18:23
tutorial demonstrates how to combine DDP with RPC to train a model using
distributed data parallelism combined with distributed model parallelism.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, suggesting to add more links in the end to the existing tutorials and other sources that would help users to get started. Could add under the See Also section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, that's a good point. i'm actually not all that familiar with the tutorials content we already have, but i suspect a lot of it is out of date. I may do a search for any important ones to add now, but i think we can also build that up more over time as we add more tutorials or update old ones.

@wconstab wconstab force-pushed the whc/cleanup branch 3 times, most recently from 1f0c64d to 064c024 Compare June 11, 2024 17:29
@wconstab wconstab requested review from wanchaol and svekars June 11, 2024 17:31
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a heading for each section makes things look much nicer! Have a few more suggestions inlined after looked at the previewed doc page.


Data Parallel Training
----------------------
``TorchElastic`` provides the widely used `torchrun <https://pytorch.org/docs/stable/elastic/run.html>` launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still want to brand it as TorchElastic? Or we should just say TorchRun?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc @kurman

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for cleaning up and rewriting our overview documentation!

One small nit and it looks like the CI captured a scalining typo

Many overdue updates
* updating the overview to include TP/PP and DTensor/Devicemesh
* removing RPC, DataParallel and Elastic as they are no longer supported
@GuWei007
Copy link

@wconstab Why torch.distributed.elastic is no longer supported

@wconstab
Copy link
Contributor Author

@wconstab Why torch.distributed.elastic is no longer supported

I just tried to simplify the 'overview' doc to point to torchrun. torchrun is the main binary component of torchelastic, and we do support it.

@wconstab
Copy link
Contributor Author

@svekars can you merge this at your convenience if it looks OK to you?

@svekars svekars merged commit 2748f2c into pytorch:main Jun 13, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants