-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout when downloading dataset metadata with 8 torchrun workers #2272
Labels
bug
Something isn't working
Comments
I manage to solve my problem by using
I guess yes since increasing the timeout allow my run to start. Feel free to close the issue now that I have a working solution |
Thanks for sharing your solution @samsja! I'll close this issue then :) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
hey, I am experiencing time out when downloading a dataset. I would like to be able to increase this time out, either having a longer default or via env variable.
Reproduction
I am using the following dataset
load_dataset("allenai/c4", "en", streaming=True)
in streaming mode and get the error below.This only happened when suing torchrun with 8 workers, using 2 workers is working. My guess is that the worker fight for bandwith leading to the time out when there are too many workers.
I actually "fix" the issue locally by patching the time out in this line:
huggingface_hub/src/huggingface_hub/hf_api.py
Line 2306 in 5ff2d15
I would like to increase this timeout in a more secure way.
Thanks in advance 🙏
Logs
System info
The text was updated successfully, but these errors were encountered: