Interesting work! Few questions #20

QingHe-He · 2025-02-21T02:46:42Z

Hi,

Intuitively, pcie bandwidth is much lower than hbm bandwidth. It seems that its difficult to converge unless u severely slowdown XPU when using CoW or soft dirty bit, which may largely affects the value of this work?

Is it possible in most cases where memory wall is the bottleneck this work has only little or even none improvement compared with the traditional stop-the-world ckpt? Did i get it wrong?

Thx for reply.

wxdwfc · 2025-02-21T07:55:31Z

Thank you for your interest and questions. I disagree with the argument that "in most cases where memory wall is the bottleneck" and would appreciate if you could provide evidence to support this claim. To clarify: concurrent checkpointing will provide non-trivial speedup as long as the copy operation does not take substantially longer than the computation.

The reasoning behind your claim is that "PCIe bandwidth is much lower than HBM bandwidth." While this is true, it doesn't necessarily mean that computations will generate buffers that need to be isolated (or re-copied) faster than PCIe can transfer them. For example, during training, most time is spent on computation and communication (allowing us to do the concurrent copy), with only a short write-back phase. During inference, most of the HBM buffer remains unchanged, eliminating the need for isolation or re-copying. Please check our current preprint for the evaluations (which we don't slow down the XPU execution), and we will soon release a more detailed technical report with more supported workloads like multi-GPU training.

QingHe-He · 2025-02-21T09:24:40Z

really appreciate ur reply. just like pre-copy vm migration, your contribution is a natural and interesting idea.

But when comes to gpu/npu, it becomes a lot more complicated because cpu bypass. I just want to raise my concern. Wish to see ur more evaluations. Maybe u could test your work in such scenario.

Btw, during inference migration, there may exist some faster way to do ckpt? for example, pull buffer directly from one gpu to another?

wxdwfc · 2025-02-21T09:28:42Z

Thank you for your response.

"But when comes to gpu/npu, " Actually our pre-copy and others work on NVIDIA GPU like A100 :). You can kindly check our preprint for more information on how we have done that. You can also try our code (currently we don't release migration related code, but the checkpoint and restore works fine), and we will release a more stable version soon.

"Btw, during inference migration, there may exist some faster way to do ckpt? for example, pull buffer directly from one gpu to another?"
Do you mean something like GPU-direct RDMA? If it is, yes we have adopted it for the migration case.

QingHe-He · 2025-02-24T08:51:18Z

yes. I reconsidered this problem. We are discussing the problem actually from two different perspectives. One approach is to reduce the downtime by minimizing data transfer. The other is actually about how to increase bandwidth. However, even RDMA end-to-end throughput is limited by pcie bandwidth. so I'm thinking maybe we might be able to overcome this bottleneck? e.g. sth like multipath?

wxdwfc · 2025-02-24T12:34:10Z

Yes. My point is that bandwidth represents a hard limit that is difficult to reduce except with additional assumptions (which could compromise the transparency of tools like CRIU).

Note that our recent work on fast model loading uses optimized RDMA multicast to overcome the single-machine PCIe limit (https://arxiv.org/pdf/2412.17246). Nevertheless, this solution only works for autoscaling and cannot be applied to cases like live migration.

QingHe-He · 2025-03-03T05:54:15Z

Thx for your reply. Btw, by reading your paper, the whole contribution is based on api remoting. Does phoenixos support feature like cuda graph?

wxdwfc · 2025-03-03T05:59:45Z

Thanks for your interests. I think it's misleading to say the whole contribution is based on remoting: only the context pool relies on remoting, while other designs don't. Remoting is just an implementation choice.

Regarding CUDA graph, our current code doesn't support it, but we plan to add support in the future (though it's not our highest priority at the moment).

913887524gsd · 2025-03-03T06:18:25Z

At least I have not seen any remoting framework said it supports CUDA graph. CUDA graph APIs are hard to remote, the main difficulty lies in its feature: CUDA allows user to run as normal execution while simultaneously recording kernel/memory ops to construct CUDA graph. This makes CUDA graph object non-transparent to remoting framework(the remoting framework could potentially emulate this feature... I do not know lol).
You can see this paper released several days ago: http://arxiv.org/abs/2502.16631. It points out CUDA graph remoting is a challenge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interesting work! Few questions #20

Interesting work! Few questions #20

QingHe-He commented Feb 21, 2025

wxdwfc commented Feb 21, 2025 •

edited

Loading

QingHe-He commented Feb 21, 2025

wxdwfc commented Feb 21, 2025

QingHe-He commented Feb 24, 2025

wxdwfc commented Feb 24, 2025

QingHe-He commented Mar 3, 2025

wxdwfc commented Mar 3, 2025

913887524gsd commented Mar 3, 2025 •

edited

Loading

Interesting work! Few questions #20

Interesting work! Few questions #20

Comments

QingHe-He commented Feb 21, 2025

wxdwfc commented Feb 21, 2025 • edited Loading

QingHe-He commented Feb 21, 2025

wxdwfc commented Feb 21, 2025

QingHe-He commented Feb 24, 2025

wxdwfc commented Feb 24, 2025

QingHe-He commented Mar 3, 2025

wxdwfc commented Mar 3, 2025

913887524gsd commented Mar 3, 2025 • edited Loading

wxdwfc commented Feb 21, 2025 •

edited

Loading

913887524gsd commented Mar 3, 2025 •

edited

Loading