Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Distributed inference OOMs on machines with different RAM size #1804

Open
silibattlebot opened this issue Jan 28, 2025 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@silibattlebot
Copy link

silibattlebot commented Jan 28, 2025

Describe the bug
Running distributed inference of DeepSeek-R1-3bit on three M2 Ultra machines fails.

Desktop:

  • OS Version: macOS Sequoia 15.2
  • mpirun (Open MPI) 4.1.1 installed via conda
  • Python 3.12.7 | packaged by Anaconda, Inc
  • Version: 0.22.0 built from source at e6a7ab9
  • machine1: 192GB M2 Ultra
  • machine2: 128GB M2 Ultra
  • machine3: 64GB M2 Ultra
  • Connectivity: 1Gb Ethernet switch connected to Ethernet 1 of all devices.

To Reproduce

  1. Follow instructions in this gist, including setting the GPU limit of each machine to ~80% of its capacity.
  2. Launch inference as above.
mpirun -np 3 --hostfile hosts.txt /path/to/anaconda/python3 /path/to/pipeline_generate.py --prompt "Hello world"

hosts.txt:

machine1 slots=1 
machine2 slots=1
machine3 slots=1

Actual behavior

  1. Machines 1 and 2 (192GB and 128GB memory) load about 105GB of weights without using any swap. They have ~90GB and 25GB remaining.
  2. Machine 3 (with 64 GB memory) hits 58.5 GB RAM utilization, does not stop.
  3. Machine 1 shows an error:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
  1. prterun then exits with an error on Machine 1.
  2. Machines 2 and 3 hold ~100GB in RAM and/or swap for another 30 seconds before exiting.

Expected behavior
Each machine loads up about 80-90% of its memory with weights and does not OOM. Inference eventually runs and produces tokens.

Additional context
MPI log:

(base) user@machine1 ~ % mpirun -np 3 --hostfile hosts.txt /opt/homebrew/anaconda3/bin/python3 /Users/user/deepseek/pipeline_generate.py --prompt "Hello world"
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 27562.67it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 41342.20it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 34166.70it/s]
[WARNING] Generating with a model that requires 100534 MB which is close to the maximum recommended size of 48000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 110000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[machine1:42301] *** Process received signal ***
[machine1:42301] Signal: Abort trap: 6 (6)
[machine1:42301] Signal code:  (0)
[machine1:42301] [ 0] 0   libsystem_platform.dylib            0x0000000198072e04 _sigtramp + 56
[machine1:42301] [ 1] 0   libsystem_pthread.dylib             0x000000019803bf70 pthread_kill + 288
[machine1:42301] [ 2] 0   libsystem_c.dylib                   0x0000000197f48908 abort + 128
[machine1:42301] [ 3] 0   libc++abi.dylib                     0x0000000197ff244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[machine1:42301] [ 4] 0   libc++abi.dylib                     0x0000000197fe0a24 _ZL28demangling_terminate_handlerv + 320
[machine1:42301] [ 5] 0   libobjc.A.dylib                     0x0000000197c893f4 _ZL15_objc_terminatev + 172
[machine1:42301] [ 6] 0   libc++abi.dylib                     0x0000000197ff1710 _ZSt11__terminatePFvvE + 16
[machine1:42301] [ 7] 0   libc++abi.dylib                     0x0000000197ff16b4 _ZSt9terminatev + 108
[machine1:42301] [ 8] 0   libdispatch.dylib                   0x0000000197e89688 _dispatch_client_callout4 + 40
[machine1:42301] [ 9] 0   libdispatch.dylib                   0x0000000197ea5c88 _dispatch_mach_msg_invoke + 464
[machine1:42301] [10] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [11] 0   libdispatch.dylib                   0x0000000197ea69dc _dispatch_mach_invoke + 456
[machine1:42301] [12] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [13] 0   libdispatch.dylib                   0x0000000197e91764 _dispatch_lane_invoke + 432
[machine1:42301] [14] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [15] 0   libdispatch.dylib                   0x0000000197e91730 _dispatch_lane_invoke + 380
[machine1:42301] [16] 0   libdispatch.dylib                   0x0000000197e9c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[machine1:42301] [17] 0   libdispatch.dylib                   0x0000000197e9c1ec _dispatch_workloop_worker_thread + 540
[machine1:42301] [18] 0   libsystem_pthread.dylib             0x00000001980383d8 _pthread_wqthread + 288
[machine1:42301] [19] 0   libsystem_pthread.dylib             0x00000001980370f0 start_wqthread + 8
[machine1:42301] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/opt/homebrew/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node machine1 exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
@awni
Copy link
Member

awni commented Jan 28, 2025

So the problem here is the pipeline parallel is pretty dumb and assumes each machine has an equal amount of RAM. It divides the model evenly in three sections and the third section is way too big for your 64GB M2 Ultra.

We could do something a bit more dynamic based on the machine size to support heterogenous machines.

@awni awni added the enhancement New feature or request label Jan 28, 2025
@ProjectAtlantis-dev
Copy link

is mlx doing its own sharding ? i thought you needed exo for that

@awni
Copy link
Member

awni commented Jan 28, 2025

Yes MLX can do distributed inference directly using mx.distributed. RIght now, it's a lower level API than what you can do with Exo. So depends on what you want to do.

@silibattlebot
Copy link
Author

Thank you for the reply @awni.

So the problem here is the pipeline parallel is pretty dumb and assumes each machine has an equal amount of RAM. It divides the model evenly in three sections and the third section is way too big for your 64GB M2 Ultra.

Thanks for the clarification. It did feel that way reading through https://github.com/ml-explore/mlx/tree/main/mlx/distributed.

We could do something a bit more dynamic based on the machine size to support heterogenous machines.

Supporting this configuration does feel a bit hard to justify. Most clusters use identical machines, or a homogenous sub-cluster within a larger supercomputer, for exactly the purpose of avoiding issues like this.

Happy to discuss & test any PRs on this cluster while I have it.

@silibattlebot
Copy link
Author

silibattlebot commented Jan 28, 2025

is mlx doing its own sharding ? i thought you needed exo for that

For future reference, this configuration also does not presently work on exo either: exo-explore/exo#641

It works now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants