You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training large models (especially 32B parameter models) with distributed processing, there's a potential issue where rl_global_batch can become zero if _world_size is large. This causes a ZeroDivisionError in the code. Is there any reasonable method to fix this problem?
The text was updated successfully, but these errors were encountered:
OREAL/train_oreal.py
Lines 598 to 613 in 133434b
When training large models (especially 32B parameter models) with distributed processing, there's a potential issue where
rl_global_batch
can become zero if_world_size
is large. This causes a ZeroDivisionError in the code. Is there any reasonable method to fix this problem?The text was updated successfully, but these errors were encountered: