Skip to content

[BUG] Distributed inference OOMs on machines with different RAM size #1804

Open
@silibattlebot

Description

@silibattlebot

Describe the bug
Running distributed inference of DeepSeek-R1-3bit on three M2 Ultra machines fails.

Desktop:

  • OS Version: macOS Sequoia 15.2
  • mpirun (Open MPI) 4.1.1 installed via conda
  • Python 3.12.7 | packaged by Anaconda, Inc
  • Version: 0.22.0 built from source at e6a7ab9
  • machine1: 192GB M2 Ultra
  • machine2: 128GB M2 Ultra
  • machine3: 64GB M2 Ultra
  • Connectivity: 1Gb Ethernet switch connected to Ethernet 1 of all devices.

To Reproduce

  1. Follow instructions in this gist, including setting the GPU limit of each machine to ~80% of its capacity.
  2. Launch inference as above.
mpirun -np 3 --hostfile hosts.txt /path/to/anaconda/python3 /path/to/pipeline_generate.py --prompt "Hello world"

hosts.txt:

machine1 slots=1 
machine2 slots=1
machine3 slots=1

Actual behavior

  1. Machines 1 and 2 (192GB and 128GB memory) load about 105GB of weights without using any swap. They have ~90GB and 25GB remaining.
  2. Machine 3 (with 64 GB memory) hits 58.5 GB RAM utilization, does not stop.
  3. Machine 1 shows an error:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
  1. prterun then exits with an error on Machine 1.
  2. Machines 2 and 3 hold ~100GB in RAM and/or swap for another 30 seconds before exiting.

Expected behavior
Each machine loads up about 80-90% of its memory with weights and does not OOM. Inference eventually runs and produces tokens.

Additional context
MPI log:

(base) user@machine1 ~ % mpirun -np 3 --hostfile hosts.txt /opt/homebrew/anaconda3/bin/python3 /Users/user/deepseek/pipeline_generate.py --prompt "Hello world"
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 27562.67it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 41342.20it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 34166.70it/s]
[WARNING] Generating with a model that requires 100534 MB which is close to the maximum recommended size of 48000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 110000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[machine1:42301] *** Process received signal ***
[machine1:42301] Signal: Abort trap: 6 (6)
[machine1:42301] Signal code:  (0)
[machine1:42301] [ 0] 0   libsystem_platform.dylib            0x0000000198072e04 _sigtramp + 56
[machine1:42301] [ 1] 0   libsystem_pthread.dylib             0x000000019803bf70 pthread_kill + 288
[machine1:42301] [ 2] 0   libsystem_c.dylib                   0x0000000197f48908 abort + 128
[machine1:42301] [ 3] 0   libc++abi.dylib                     0x0000000197ff244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[machine1:42301] [ 4] 0   libc++abi.dylib                     0x0000000197fe0a24 _ZL28demangling_terminate_handlerv + 320
[machine1:42301] [ 5] 0   libobjc.A.dylib                     0x0000000197c893f4 _ZL15_objc_terminatev + 172
[machine1:42301] [ 6] 0   libc++abi.dylib                     0x0000000197ff1710 _ZSt11__terminatePFvvE + 16
[machine1:42301] [ 7] 0   libc++abi.dylib                     0x0000000197ff16b4 _ZSt9terminatev + 108
[machine1:42301] [ 8] 0   libdispatch.dylib                   0x0000000197e89688 _dispatch_client_callout4 + 40
[machine1:42301] [ 9] 0   libdispatch.dylib                   0x0000000197ea5c88 _dispatch_mach_msg_invoke + 464
[machine1:42301] [10] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [11] 0   libdispatch.dylib                   0x0000000197ea69dc _dispatch_mach_invoke + 456
[machine1:42301] [12] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [13] 0   libdispatch.dylib                   0x0000000197e91764 _dispatch_lane_invoke + 432
[machine1:42301] [14] 0   libdispatch.dylib                   0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [15] 0   libdispatch.dylib                   0x0000000197e91730 _dispatch_lane_invoke + 380
[machine1:42301] [16] 0   libdispatch.dylib                   0x0000000197e9c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[machine1:42301] [17] 0   libdispatch.dylib                   0x0000000197e9c1ec _dispatch_workloop_worker_thread + 540
[machine1:42301] [18] 0   libsystem_pthread.dylib             0x00000001980383d8 _pthread_wqthread + 288
[machine1:42301] [19] 0   libsystem_pthread.dylib             0x00000001980370f0 start_wqthread + 8
[machine1:42301] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/opt/homebrew/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node machine1 exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions