You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i test llama2 13b on a800, the pp parallelism is 4 and micro-batch-size = 1 and global-batch-size = 64
the 1f1b log, i just use 1f1b, not use vp
iteration 1/ 500000 | consumed samples: 64 | elapsed time per iteration (ms): 23376.6 | learning rate: 4.687E-08 | global batch size: 64 | lm loss: 1.123916E+01 | loss scale: 1.0 | grad norm: 121.332 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 2/ 500000 | consumed samples: 128 | elapsed time per iteration (ms): 15149.3 | learning rate: 9.375E-08 | global batch size: 64 | lm loss: 1.138808E+01 | loss scale: 1.0 | grad norm: 15.865 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3/ 500000 | consumed samples: 192 | elapsed time per iteration (ms): 15153.5 | learning rate: 1.406E-07 | global batch size: 64 | lm loss: 1.138511E+01 | loss scale: 1.0 | grad norm: 15.744 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 4/ 500000 | consumed samples: 256 | elapsed time per iteration (ms): 15154.5 | learning rate: 1.875E-07 | global batch size: 64 | lm loss: 1.131369E+01 | loss scale: 1.0 | grad norm: 62.191 | number of skipped iterations: 0 | number of nan iterations: 0 |
the zero-v log
iteration 1/ 500000 | consumed samples: 64 | elapsed time per iteration (ms): 23561.4 | learning rate: 4.687E-08 | global batch size: 64 | lm loss: 1.037349E+01 | loss scale: 1.0 | grad norm: 2.278 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 2/ 500000 | consumed samples: 128 | elapsed time per iteration (ms): 15432.7 | learning rate: 9.375E-08 | global batch size: 64 | lm loss: 1.037349E+01 | loss scale: 1.0 | grad norm: 0.453 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3/ 500000 | consumed samples: 192 | elapsed time per iteration (ms): 16140.2 | learning rate: 1.406E-07 | global batch size: 64 | lm loss: 1.037348E+01 | loss scale: 1.0 | grad norm: 0.442 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 4/ 500000 | consumed samples: 256 | elapsed time per iteration (ms): 16202.1 | learning rate: 1.875E-07 | global batch size: 64 | lm loss: 1.037344E+01 | loss scale: 1.0 | grad norm: 1.198 | number of skipped iterations: 0 | number of nan iterations: 0 |
i test llama2 13b on a800, the pp parallelism is 4 and micro-batch-size = 1 and global-batch-size = 64
the 1f1b log, i just use 1f1b, not use vp
iteration 1/ 500000 | consumed samples: 64 | elapsed time per iteration (ms): 23376.6 | learning rate: 4.687E-08 | global batch size: 64 | lm loss: 1.123916E+01 | loss scale: 1.0 | grad norm: 121.332 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 2/ 500000 | consumed samples: 128 | elapsed time per iteration (ms): 15149.3 | learning rate: 9.375E-08 | global batch size: 64 | lm loss: 1.138808E+01 | loss scale: 1.0 | grad norm: 15.865 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3/ 500000 | consumed samples: 192 | elapsed time per iteration (ms): 15153.5 | learning rate: 1.406E-07 | global batch size: 64 | lm loss: 1.138511E+01 | loss scale: 1.0 | grad norm: 15.744 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 4/ 500000 | consumed samples: 256 | elapsed time per iteration (ms): 15154.5 | learning rate: 1.875E-07 | global batch size: 64 | lm loss: 1.131369E+01 | loss scale: 1.0 | grad norm: 62.191 | number of skipped iterations: 0 | number of nan iterations: 0 |
the zero-v log
iteration 1/ 500000 | consumed samples: 64 | elapsed time per iteration (ms): 23561.4 | learning rate: 4.687E-08 | global batch size: 64 | lm loss: 1.037349E+01 | loss scale: 1.0 | grad norm: 2.278 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 2/ 500000 | consumed samples: 128 | elapsed time per iteration (ms): 15432.7 | learning rate: 9.375E-08 | global batch size: 64 | lm loss: 1.037349E+01 | loss scale: 1.0 | grad norm: 0.453 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 3/ 500000 | consumed samples: 192 | elapsed time per iteration (ms): 16140.2 | learning rate: 1.406E-07 | global batch size: 64 | lm loss: 1.037348E+01 | loss scale: 1.0 | grad norm: 0.442 | number of skipped iterations: 0 | number of nan iterations: 0 |
iteration 4/ 500000 | consumed samples: 256 | elapsed time per iteration (ms): 16202.1 | learning rate: 1.875E-07 | global batch size: 64 | lm loss: 1.037344E+01 | loss scale: 1.0 | grad norm: 1.198 | number of skipped iterations: 0 | number of nan iterations: 0 |
the zero-v i use this:
The text was updated successfully, but these errors were encountered: