Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting work! Few questions #20

Open
QingHe-He opened this issue Feb 21, 2025 · 8 comments
Open

Interesting work! Few questions #20

QingHe-He opened this issue Feb 21, 2025 · 8 comments

Comments

@QingHe-He
Copy link

Hi,

Intuitively, pcie bandwidth is much lower than hbm bandwidth. It seems that its difficult to converge unless u severely slowdown XPU when using CoW or soft dirty bit, which may largely affects the value of this work?

Is it possible in most cases where memory wall is the bottleneck this work has only little or even none improvement compared with the traditional stop-the-world ckpt? Did i get it wrong?

Thx for reply.

@wxdwfc
Copy link

wxdwfc commented Feb 21, 2025

Thank you for your interest and questions. I disagree with the argument that "in most cases where memory wall is the bottleneck" and would appreciate if you could provide evidence to support this claim. To clarify: concurrent checkpointing will provide non-trivial speedup as long as the copy operation does not take substantially longer than the computation.

The reasoning behind your claim is that "PCIe bandwidth is much lower than HBM bandwidth." While this is true, it doesn't necessarily mean that computations will generate buffers that need to be isolated (or re-copied) faster than PCIe can transfer them. For example, during training, most time is spent on computation and communication (allowing us to do the concurrent copy), with only a short write-back phase. During inference, most of the HBM buffer remains unchanged, eliminating the need for isolation or re-copying. Please check our current preprint for the evaluations (which we don't slow down the XPU execution), and we will soon release a more detailed technical report with more supported workloads like multi-GPU training.

@QingHe-He
Copy link
Author

really appreciate ur reply. just like pre-copy vm migration, your contribution is a natural and interesting idea.

But when comes to gpu/npu, it becomes a lot more complicated because cpu bypass. I just want to raise my concern. Wish to see ur more evaluations. Maybe u could test your work in such scenario.

Btw, during inference migration, there may exist some faster way to do ckpt? for example, pull buffer directly from one gpu to another?

@wxdwfc
Copy link

wxdwfc commented Feb 21, 2025

Thank you for your response.

"But when comes to gpu/npu, " Actually our pre-copy and others work on NVIDIA GPU like A100 :). You can kindly check our preprint for more information on how we have done that. You can also try our code (currently we don't release migration related code, but the checkpoint and restore works fine), and we will release a more stable version soon.

"Btw, during inference migration, there may exist some faster way to do ckpt? for example, pull buffer directly from one gpu to another?"
Do you mean something like GPU-direct RDMA? If it is, yes we have adopted it for the migration case.

@QingHe-He
Copy link
Author

yes. I reconsidered this problem. We are discussing the problem actually from two different perspectives. One approach is to reduce the downtime by minimizing data transfer. The other is actually about how to increase bandwidth. However, even RDMA end-to-end throughput is limited by pcie bandwidth. so I'm thinking maybe we might be able to overcome this bottleneck? e.g. sth like multipath?

@wxdwfc
Copy link

wxdwfc commented Feb 24, 2025

Yes. My point is that bandwidth represents a hard limit that is difficult to reduce except with additional assumptions (which could compromise the transparency of tools like CRIU).

Note that our recent work on fast model loading uses optimized RDMA multicast to overcome the single-machine PCIe limit (https://arxiv.org/pdf/2412.17246). Nevertheless, this solution only works for autoscaling and cannot be applied to cases like live migration.

@QingHe-He
Copy link
Author

Thx for your reply. Btw, by reading your paper, the whole contribution is based on api remoting. Does phoenixos support feature like cuda graph?

@wxdwfc
Copy link

wxdwfc commented Mar 3, 2025

Thanks for your interests. I think it's misleading to say the whole contribution is based on remoting: only the context pool relies on remoting, while other designs don't. Remoting is just an implementation choice.

Regarding CUDA graph, our current code doesn't support it, but we plan to add support in the future (though it's not our highest priority at the moment).

@913887524gsd
Copy link

913887524gsd commented Mar 3, 2025

At least I have not seen any remoting framework said it supports CUDA graph. CUDA graph APIs are hard to remote, the main difficulty lies in its feature: CUDA allows user to run as normal execution while simultaneously recording kernel/memory ops to construct CUDA graph. This makes CUDA graph object non-transparent to remoting framework(the remoting framework could potentially emulate this feature... I do not know lol).
You can see this paper released several days ago: http://arxiv.org/abs/2502.16631. It points out CUDA graph remoting is a challenge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants