Open
Description
Describe the bug
Running distributed inference of DeepSeek-R1-3bit on three M2 Ultra machines fails.
Desktop:
- OS Version: macOS Sequoia 15.2
- mpirun (Open MPI) 4.1.1 installed via conda
- Python 3.12.7 | packaged by Anaconda, Inc
- Version: 0.22.0 built from source at e6a7ab9
- machine1: 192GB M2 Ultra
- machine2: 128GB M2 Ultra
- machine3: 64GB M2 Ultra
- Connectivity: 1Gb Ethernet switch connected to Ethernet 1 of all devices.
To Reproduce
- Follow instructions in this gist, including setting the GPU limit of each machine to ~80% of its capacity.
- Launch inference as above.
mpirun -np 3 --hostfile hosts.txt /path/to/anaconda/python3 /path/to/pipeline_generate.py --prompt "Hello world"
hosts.txt:
machine1 slots=1
machine2 slots=1
machine3 slots=1
Actual behavior
- Machines 1 and 2 (192GB and 128GB memory) load about 105GB of weights without using any swap. They have ~90GB and 25GB remaining.
- Machine 3 (with 64 GB memory) hits 58.5 GB RAM utilization, does not stop.
- Machine 1 shows an error:
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
- prterun then exits with an error on Machine 1.
- Machines 2 and 3 hold ~100GB in RAM and/or swap for another 30 seconds before exiting.
Expected behavior
Each machine loads up about 80-90% of its memory with weights and does not OOM. Inference eventually runs and produces tokens.
Additional context
MPI log:
(base) user@machine1 ~ % mpirun -np 3 --hostfile hosts.txt /opt/homebrew/anaconda3/bin/python3 /Users/user/deepseek/pipeline_generate.py --prompt "Hello world"
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 27562.67it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 41342.20it/s]
Fetching 75 files: 100%|██████████| 75/75 [00:00<00:00, 34166.70it/s]
[WARNING] Generating with a model that requires 100534 MB which is close to the maximum recommended size of 48000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 110000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[machine1:42301] *** Process received signal ***
[machine1:42301] Signal: Abort trap: 6 (6)
[machine1:42301] Signal code: (0)
[machine1:42301] [ 0] 0 libsystem_platform.dylib 0x0000000198072e04 _sigtramp + 56
[machine1:42301] [ 1] 0 libsystem_pthread.dylib 0x000000019803bf70 pthread_kill + 288
[machine1:42301] [ 2] 0 libsystem_c.dylib 0x0000000197f48908 abort + 128
[machine1:42301] [ 3] 0 libc++abi.dylib 0x0000000197ff244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[machine1:42301] [ 4] 0 libc++abi.dylib 0x0000000197fe0a24 _ZL28demangling_terminate_handlerv + 320
[machine1:42301] [ 5] 0 libobjc.A.dylib 0x0000000197c893f4 _ZL15_objc_terminatev + 172
[machine1:42301] [ 6] 0 libc++abi.dylib 0x0000000197ff1710 _ZSt11__terminatePFvvE + 16
[machine1:42301] [ 7] 0 libc++abi.dylib 0x0000000197ff16b4 _ZSt9terminatev + 108
[machine1:42301] [ 8] 0 libdispatch.dylib 0x0000000197e89688 _dispatch_client_callout4 + 40
[machine1:42301] [ 9] 0 libdispatch.dylib 0x0000000197ea5c88 _dispatch_mach_msg_invoke + 464
[machine1:42301] [10] 0 libdispatch.dylib 0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [11] 0 libdispatch.dylib 0x0000000197ea69dc _dispatch_mach_invoke + 456
[machine1:42301] [12] 0 libdispatch.dylib 0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [13] 0 libdispatch.dylib 0x0000000197e91764 _dispatch_lane_invoke + 432
[machine1:42301] [14] 0 libdispatch.dylib 0x0000000197e90a38 _dispatch_lane_serial_drain + 352
[machine1:42301] [15] 0 libdispatch.dylib 0x0000000197e91730 _dispatch_lane_invoke + 380
[machine1:42301] [16] 0 libdispatch.dylib 0x0000000197e9c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[machine1:42301] [17] 0 libdispatch.dylib 0x0000000197e9c1ec _dispatch_workloop_worker_thread + 540
[machine1:42301] [18] 0 libsystem_pthread.dylib 0x00000001980383d8 _pthread_wqthread + 288
[machine1:42301] [19] 0 libsystem_pthread.dylib 0x00000001980370f0 start_wqthread + 8
[machine1:42301] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/opt/homebrew/anaconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node machine1 exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------