New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLI]: GPUs Hanging when distributed training caused by wandb.watch
#7423
Comments
Hi @nmd2k, thank you for reporting this and letting us know you have been experiencing the issue with log="all" only. Would you mind sharing some additional information to help us reproduce and troubleshoot the issue:
|
Hi @nmd2k , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved. |
Hi, sorry for the late reply. if accelerator.is_main_process and args.report_to == "wandb":
wandb.watch(model, log="all", log_freq=args.logging_steps) Here is the
Here is the |
Same issue. The whole training process will hold for several minutes every fixed steps (rank 0 GPU will be 0% util). Everything returns to normal as I delete the "wandb_watch["all"]". |
Describe the bug
I found that when distributed training, the
wandb.watch
with the argumentlog="all"
will lead to GPUs hanging (the rank 0 GPUs loaded but not working, the rank 1 run without any further progress).(with
log="all"
)The integrated code:
The problem is gone when removed the argument
log="all"
. It seems like something is wrong with logging model parameter.(without
log="all"
)Additional Files
No response
Environment
wandb = 0.16.5
transformers = 4.39.0
pytorch = 2.2.1
accelerate = 0.29.3
Additional Context
No response
The text was updated successfully, but these errors were encountered: