Implement gradient clipping #286

tengyifei · 2025-06-06T04:14:55Z

Fixes #90

The Hugging Face trainer clips gradient by norm by default. It turns out this makes a big difference in training stability and simply adding gradient clipping will bring torchprime into parity with Hugging Face.

Experiment: http://tb/share/yuBUPRF6KqZKPhwJkXNvc

Tested:

tp run --name tp-linear-clip-norm-2 torchprime/torch_xla_models/train.py model=llama-3-8b ici_mesh.fsdp=256 profile_step=3 profile_duration=30000 task.max_steps=1000 logging_steps=1 task.global_batch_size=256 dataset.hf_dataset_config_name=wikitext-103-raw-v1 run_name=tp-linear-clip-norm-2

tp run --use-hf torchprime/hf_models/train.py train_script.args.per_device_train_batch_size=256 +train_script.args.log_loss=true train_script.args.logging_strategy=steps +train_script.args.logging_steps=1 +train_script.args.logging_first_step=true +train_script.args.report_to=tensorboard train_script.args.max_steps=1000

yaoshiang

grad clipping by norm is not the only way to clip grads. Clipping by absolute value is valid. Can you update this PR to enable both? Also, the grad clipping should be configuring outside the optimizer. The grad clipping is upstream of the optimizer - the optimizer has no knowledge of grad clipping - can you move the grad clipping config to a top-level section?

tengyifei · 2025-06-06T23:26:58Z

@yaoshiang done. ptal

yaoshiang

looking good! I realize you are pushing hard to finish this off, if you have cycles, please add a unit test of some kind. Numerical would be ideal but might take a bit of time. You could do like a single layer, with fixed numbers, have the loss function be MAE, so now you know the exact grads, try your clip, and see if you get what you expect. If you feed those instructions to an LLM it'll probably do a decent job. But consider it optional since you have numerical proof against the HF curves that the norm version at least is working./ Thanks!

Implement gradient clipping

ba624da

tengyifei requested review from jialei777, vlasenkoalexey and yaoshiang June 6, 2025 04:52

tengyifei marked this pull request as ready for review June 6, 2025 04:53

Fix tests

c152691

tengyifei enabled auto-merge (squash) June 6, 2025 05:23

yaoshiang requested changes Jun 6, 2025

View reviewed changes

jialei777 approved these changes Jun 6, 2025

View reviewed changes

tengyifei requested a review from yaoshiang June 6, 2025 23:26

Support both clipping by norm and by value

74c583c

tengyifei force-pushed the yifeit/fix-convergence branch from 86636b3 to 74c583c Compare June 6, 2025 23:27

tengyifei added 2 commits June 6, 2025 16:45

Update base_trainer.py

eb9abd0

Update test_trainer.py

c160007

yaoshiang approved these changes Jun 7, 2025

View reviewed changes

tengyifei disabled auto-merge June 7, 2025 00:00

Add gradient clipping tests

29c1da3

tengyifei enabled auto-merge (squash) June 7, 2025 08:22

tengyifei merged commit 3a2ce6f into main Jun 7, 2025
13 checks passed

tengyifei deleted the yifeit/fix-convergence branch June 7, 2025 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement gradient clipping #286

Implement gradient clipping #286

Uh oh!

tengyifei commented Jun 6, 2025 •

edited

Loading

Uh oh!

yaoshiang left a comment

Uh oh!

tengyifei commented Jun 6, 2025

Uh oh!

yaoshiang left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Implement gradient clipping #286

Implement gradient clipping #286

Uh oh!

Conversation

tengyifei commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaoshiang left a comment

Choose a reason for hiding this comment

Uh oh!

tengyifei commented Jun 6, 2025

Uh oh!

yaoshiang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tengyifei commented Jun 6, 2025 •

edited

Loading

yaoshiang left a comment •

edited

Loading