[BUG] Distributed inference OOMs on machines with different RAM size

**Describe the bug**
Running [distributed inference of DeepSeek-R1-3bit](https://gist.github.com/awni/ec071fd27940698edd14a4191855bba6) on three M2 Ultra machines fails.

**Desktop:**
 - OS Version: macOS Sequoia 15.2
 - mpirun (Open MPI) 4.1.1 installed via conda
 - Python 3.12.7 | packaged by Anaconda, Inc
 - Version: 0.22.0 built from source at e6a7ab967530866eb89c013f833f7c525bec10ca
 - machine1: 192GB M2 Ultra
 - machine2: 128GB M2 Ultra
 - machine3: 64GB M2 Ultra
 - Connectivity: 1Gb Ethernet switch connected to Ethernet 1 of all devices.

**To Reproduce**
1. Follow instructions in [this gist](https://gist.github.com/awni/ec071fd27940698edd14a4191855bba6), including setting the GPU limit of each machine to ~80% of its capacity.
2. Launch inference as above.
```python
mpirun -np 3 --hostfile hosts.txt /path/to/anaconda/python3 /path/to/pipeline_generate.py --prompt "Hello world"
```
hosts.txt:
```
machine1 slots=1 
machine2 slots=1
machine3 slots=1
```

**Actual behavior**
1. Machines 1 and 2 (192GB and 128GB memory) load about 105GB of weights without using any swap. They have ~90GB and 25GB remaining.
2. Machine 3 (with 64 GB memory) hits 58.5 GB RAM utilization, does not stop.
3. Machine 1 shows an error:
```
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
```
5. prterun then exits with an error on Machine 1.
6. Machines 2 and 3 hold ~100GB in RAM and/or swap for another 30 seconds before exiting.

**Expected behavior**
Each machine loads up about 80-90% of its memory with weights and does not OOM. Inference eventually runs and produces tokens.

**Additional context**
MPI log:
```
(base) user@machine1 ~ % mpirun -np 3 --hostfile hosts.txt /opt/homebrew/anaconda3/bin/python3 /Users/user/deepseek/pipeline_generate.py --prompt "Hello world"
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 27562.67it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 41342.20it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 34166.70it/s]
[WARNING] Generating with a model that requires 100534 MB which is close to the maximum recommended size of 48000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 110000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[machine1:42301] *** Process received signal ***
[machine1:42301] Signal: Abort trap: 6 (6)
[machine1:42301] Signal code:  (0)
[machine1:42301] [ 0] 0   libsystem_platform.dylib            0x0000000198072e04 _sigtramp + 56
[machine1:42301] [ 1] 0   libsystem_pthread.dylib             0x000000019803bf70 pthread_kill + 288
[machine1:42301] [ 2] 0   libsystem_c.dylib                   0x0000000197f48908 abort + 128
[machine1:42301] [ 3] 0   libc++abi.dylib                     0x0000000197ff244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[machine1:42301] [ 4] 0   libc++abi.dylib                     0x0000000197fe0a24 _ZL28demangling_terminate_handlerv + 320
[machine1:42301] [ 5] 0   libobjc.A.dylib                     0x0000000197c893f4 _ZL15_objc_terminatev + 172
[machine1:42301] [ 6] 0   libc++abi.dylib                     0x0000000197ff1710 _ZSt11__terminatePFvvE + 16
[machine1:42301] [ 7] 0   libc++abi.dylib                     0x0000000197ff16b4 _ZSt9terminatev + 108
[machine1:42301] [ 8] 0   libdispatch.dylib                   0x0000000197e89688 _dispatch_client_callout4 + 40
[machine1:42301] [ 9] 0   libdispatch.dylib                   0x0000000197ea5c88 _dispatch_mach_msg_invoke + 464
[machine1:42301] [10] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [11] 0   libdispatch.dylib                   0x0000000197ea69dc _dispatch_mach_invoke + 456
[machine1:42301] [12] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [13] 0   libdispatch.dylib                   0x0000000197e91764 _dispatch_lane_invoke + 432
[machine1:42301] [14] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [15] 0   libdispatch.dylib                   0x0000000197e91730 _dispatch_lane_invoke + 380
[machine1:42301] [16] 0   libdispatch.dylib                   0x0000000197e9c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[machine1:42301] [17] 0   libdispatch.dylib                   0x0000000197e9c1ec _dispatch_workloop_worker_thread + 540
[machine1:42301] [18] 0   libsystem_pthread.dylib             0x00000001980383d8 _pthread_wqthread + 288
[machine1:42301] [19] 0   libsystem_pthread.dylib             0x00000001980370f0 start_wqthread + 8
[machine1:42301] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/opt/homebrew/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node machine1 exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Distributed inference OOMs on machines with different RAM size #1804

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Distributed inference OOMs on machines with different RAM size #1804

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions