Replies: 1 comment 2 replies
-
Is |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Is |
Beta Was this translation helpful? Give feedback.
-
I start horovod with
8
gpu devices where each device uses1/8
part of all data. The job trains6
epochs. The first5
epochs run ok. But when the last epoch of the chief is done, an error will occur beforesaver.save
.The error message is:
Even each device uses
1/8
part of all data, I can't assure all devices has the same batch num e.g.1/8 * all_data_num / batch_size
. So when some one device needsallreduce
, some other devices may have been stopped.Beta Was this translation helpful? Give feedback.
All reactions