Open
Description
Lines 598 to 613 in 133434b
When training large models (especially 32B parameter models) with distributed processing, there's a potential issue where rl_global_batch
can become zero if _world_size
is large. This causes a ZeroDivisionError in the code. Is there any reasonable method to fix this problem?
Metadata
Metadata
Assignees
Labels
No labels