Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine / rename dist.ddp and dist.spmd into dist.torchrun #798

Open
schmidt-ai opened this issue Dec 8, 2023 · 0 comments
Open

Combine / rename dist.ddp and dist.spmd into dist.torchrun #798

schmidt-ai opened this issue Dec 8, 2023 · 0 comments

Comments

@schmidt-ai
Copy link
Contributor

schmidt-ai commented Dec 8, 2023

Description

Currently, dist.ddp and dist.spmd are basically identical (the latter being a lightweight wrapper on the former). Also, they could be named more explicitly — dist.ddp doesn't actually involve Distributed Data Parallel, it just calls torchrun.

Motivation/Background

All else equal, simplification and explicit naming are good. For example, users leveraging Fully Sharded Data Parallel instead of DDP may find it confusing that they should be using dist.ddp.

Detailed Proposal

Refactor components/dist.py by combining the methods for ddp and spmd into one method called torchrun. Update docs, tests, examples, and callsites as appropriate.

Alternatives

  1. Leave thing as-is.
  2. Remove ddp by rolling it into spmd and keep the spmd method, so dist.spmd is the only available command and it has a "good enough" name.

Additional context/links

@danielbear

@schmidt-ai schmidt-ai changed the title Combine / rename dist.ddp and dist.spmd to dist.torchrun Combine / rename dist.ddp and dist.spmd into dist.torchrun Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant