New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Clean up dist_overview.rst #2913

Merged

svekars merged 1 commit into pytorch:main from wconstab:whc/cleanup

Jun 13, 2024

Contributor

wconstab commented Jun 7, 2024 •

edited

Loading

Many overdue updates

updating the overview to include TP/PP and DTensor/Devicemesh
removing RPC, DataParallel as they are no longer supported

Fixes #ISSUE_NUMBER

Description

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

pytorch-bot bot commented Jun 7, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2913

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 618603c with merge base 22ae7b9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the cla signed label

wanchaol reviewed

View reviewed changes

Contributor

wanchaol left a comment

The changes looks good to me! Have a few suggestions inlined.

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst

-                 tutorial demonstrates how to combine DDP with RPC to train a model using
-                 distributed data parallelism combined with distributed model parallelism.
               PyTorch Distributed Developers
               ------------------------------

Contributor

wanchaol Jun 8, 2024

We might want to clean up the https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md doc as a follow up

GuWei007 Jun 13, 2024

Why RPC is no longer supported

Contributor Author

wconstab Jun 13, 2024

RPC has not been actively developed on for a long time, and isn't being used by any of the recently developed distributed components. I am cleaning up the 'top level' overview to highlight the most active/relevant components.

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

svekars reviewed

View reviewed changes

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

wconstab force-pushed the whc/cleanup branch 2 times, most recently from 3681e81 to 7ce679b Compare

June 10, 2024 18:23

svekars reviewed

View reviewed changes

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst

		tutorial demonstrates how to combine DDP with RPC to train a model using
		distributed data parallelism combined with distributed model parallelism.

Contributor

svekars Jun 10, 2024

Also, suggesting to add more links in the end to the existing tutorials and other sources that would help users to get started. Could add under the See Also section.

Contributor Author

wconstab Jun 10, 2024

yea, that's a good point. i'm actually not all that familiar with the tutorials content we already have, but i suspect a lot of it is out of date. I may do a search for any important ones to add now, but i think we can also build that up more over time as we add more tutorials or update old ones.

wconstab force-pushed the whc/cleanup branch 3 times, most recently from 1f0c64d to 064c024 Compare

June 11, 2024 17:29

wconstab requested review from wanchaol and svekars

June 11, 2024 17:31

wanchaol reviewed

View reviewed changes

Contributor

wanchaol left a comment

I think adding a heading for each section makes things look much nicer! Have a few more suggestions inlined after looked at the previewed doc page.

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated

-              Data Parallel Training
-              ----------------------
+              ``TorchElastic`` provides the widely used `torchrun <https://pytorch.org/docs/stable/elastic/run.html>` launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.

Contributor

wanchaol Jun 12, 2024

do we still want to brand it as TorchElastic? Or we should just say TorchRun?

Contributor Author

wconstab Jun 12, 2024

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

wanchaol reviewed

View reviewed changes

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

awgu mentioned this pull request

ZeroRedundancyOptimizer + AdamW Fused can't load state_dict back pytorch/pytorch#124133

Open

wconstab force-pushed the whc/cleanup branch from 064c024 to f7dab91 Compare

June 12, 2024 17:38

wanchaol approved these changes

View reviewed changes

Contributor

wanchaol left a comment

This looks great! Thanks for cleaning up and rewriting our overview documentation!

One small nit and it looks like the CI captured a scalining typo

beginner_source/dist_overview.rst Outdated Show resolved Hide resolved

wconstab force-pushed the whc/cleanup branch from f7dab91 to b322d03 Compare

June 12, 2024 20:50


          Clean up dist_overview.rst

618603c

Many overdue updates
* updating the overview to include TP/PP and DTensor/Devicemesh
* removing RPC, DataParallel and Elastic as they are no longer supported

wconstab force-pushed the whc/cleanup branch from b322d03 to 618603c Compare

June 12, 2024 23:19

GuWei007 commented Jun 13, 2024

@wconstab Why torch.distributed.elastic is no longer supported

Contributor Author

wconstab commented Jun 13, 2024

@wconstab Why torch.distributed.elastic is no longer supported

I just tried to simplify the 'overview' doc to point to torchrun. torchrun is the main binary component of torchelastic, and we do support it.

Contributor Author

wconstab commented Jun 13, 2024

@svekars can you merge this at your convenience if it looks OK to you?

svekars approved these changes

View reviewed changes

svekars merged commit 2748f2c into pytorch:main

21 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels