-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Distributed inference OOMs on machines with different RAM size #1804
Comments
So the problem here is the pipeline parallel is pretty dumb and assumes each machine has an equal amount of RAM. It divides the model evenly in three sections and the third section is way too big for your 64GB M2 Ultra. We could do something a bit more dynamic based on the machine size to support heterogenous machines. |
is mlx doing its own sharding ? i thought you needed exo for that |
Yes MLX can do distributed inference directly using mx.distributed. RIght now, it's a lower level API than what you can do with Exo. So depends on what you want to do. |
Thank you for the reply @awni.
Thanks for the clarification. It did feel that way reading through https://github.com/ml-explore/mlx/tree/main/mlx/distributed.
Supporting this configuration does feel a bit hard to justify. Most clusters use identical machines, or a homogenous sub-cluster within a larger supercomputer, for exactly the purpose of avoiding issues like this. Happy to discuss & test any PRs on this cluster while I have it. |
It works now. |
Describe the bug
Running distributed inference of DeepSeek-R1-3bit on three M2 Ultra machines fails.
Desktop:
To Reproduce
hosts.txt:
Actual behavior
Expected behavior
Each machine loads up about 80-90% of its memory with weights and does not OOM. Inference eventually runs and produces tokens.
Additional context
MPI log:
The text was updated successfully, but these errors were encountered: