-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interesting work! Few questions #20
Comments
Thank you for your interest and questions. I disagree with the argument that "in most cases where memory wall is the bottleneck" and would appreciate if you could provide evidence to support this claim. To clarify: concurrent checkpointing will provide non-trivial speedup as long as the copy operation does not take substantially longer than the computation. The reasoning behind your claim is that "PCIe bandwidth is much lower than HBM bandwidth." While this is true, it doesn't necessarily mean that computations will generate buffers that need to be isolated (or re-copied) faster than PCIe can transfer them. For example, during training, most time is spent on computation and communication (allowing us to do the concurrent copy), with only a short write-back phase. During inference, most of the HBM buffer remains unchanged, eliminating the need for isolation or re-copying. Please check our current preprint for the evaluations (which we don't slow down the XPU execution), and we will soon release a more detailed technical report with more supported workloads like multi-GPU training. |
really appreciate ur reply. just like pre-copy vm migration, your contribution is a natural and interesting idea. But when comes to gpu/npu, it becomes a lot more complicated because cpu bypass. I just want to raise my concern. Wish to see ur more evaluations. Maybe u could test your work in such scenario. Btw, during inference migration, there may exist some faster way to do ckpt? for example, pull buffer directly from one gpu to another? |
Thank you for your response. "But when comes to gpu/npu, " Actually our pre-copy and others work on NVIDIA GPU like A100 :). You can kindly check our preprint for more information on how we have done that. You can also try our code (currently we don't release migration related code, but the checkpoint and restore works fine), and we will release a more stable version soon. "Btw, during inference migration, there may exist some faster way to do ckpt? for example, pull buffer directly from one gpu to another?" |
yes. I reconsidered this problem. We are discussing the problem actually from two different perspectives. One approach is to reduce the downtime by minimizing data transfer. The other is actually about how to increase bandwidth. However, even RDMA end-to-end throughput is limited by pcie bandwidth. so I'm thinking maybe we might be able to overcome this bottleneck? e.g. sth like multipath? |
Yes. My point is that bandwidth represents a hard limit that is difficult to reduce except with additional assumptions (which could compromise the transparency of tools like CRIU). Note that our recent work on fast model loading uses optimized RDMA multicast to overcome the single-machine PCIe limit (https://arxiv.org/pdf/2412.17246). Nevertheless, this solution only works for autoscaling and cannot be applied to cases like live migration. |
Thx for your reply. Btw, by reading your paper, the whole contribution is based on api remoting. Does phoenixos support feature like cuda graph? |
Thanks for your interests. I think it's misleading to say the whole contribution is based on remoting: only the context pool relies on remoting, while other designs don't. Remoting is just an implementation choice. Regarding CUDA graph, our current code doesn't support it, but we plan to add support in the future (though it's not our highest priority at the moment). |
At least I have not seen any remoting framework said it supports CUDA graph. CUDA graph APIs are hard to remote, the main difficulty lies in its feature: CUDA allows user to run as normal execution while simultaneously recording kernel/memory ops to construct CUDA graph. This makes CUDA graph object non-transparent to remoting framework(the remoting framework could potentially emulate this feature... I do not know lol). |
Hi,
Intuitively, pcie bandwidth is much lower than hbm bandwidth. It seems that its difficult to converge unless u severely slowdown XPU when using CoW or soft dirty bit, which may largely affects the value of this work?
Is it possible in most cases where memory wall is the bottleneck this work has only little or even none improvement compared with the traditional stop-the-world ckpt? Did i get it wrong?
Thx for reply.
The text was updated successfully, but these errors were encountered: