Gradient synchronization in data-parallel trainers #12

cgarciae · 2024-02-28T00:04:06Z

Hey, great job with nanodl!

I was just looking through the code and noticed that when in Lambda's Trainer the gradients are not being averaged across devices here:

nanodl/nanodl/__src/models/lamda.py

Lines 564 to 565 in 18c7f8e

 loss, grads = jax.value_and_grad(loss_fn)(state.params) 

 state = state.apply_gradients(grads=grads)

Not sure if this is happening elsewhere but usually to keep the weights in sync you apply a jax.lax.pmean over the gradients before passing them to apply_gradients, e.g.

grads = jax.lax.pmean(grads, axis_name='devices')

The text was updated successfully, but these errors were encountered:

HMUNACHI · 2024-03-02T20:08:56Z

Thanks for noticing this! It's often challenging to test these portions due to the unavailability of a personal multi-GPU setup for development. However, I will be accessing 2 GPUs around 10th March. Will immediately examine this but you are more than welcome to make corrections from your end if convenient, I would in fact very much appreciate that.

cgarciae changed the title ~~Gradient synchronization in data-parallel traininers~~ Gradient synchronization in data-parallel trainers Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient synchronization in data-parallel trainers #12

Gradient synchronization in data-parallel trainers #12

cgarciae commented Feb 28, 2024 •

edited

HMUNACHI commented Mar 2, 2024 •

edited

Gradient synchronization in data-parallel trainers #12

Gradient synchronization in data-parallel trainers #12

Comments

cgarciae commented Feb 28, 2024 • edited

HMUNACHI commented Mar 2, 2024 • edited

cgarciae commented Feb 28, 2024 •

edited

HMUNACHI commented Mar 2, 2024 •

edited