Skip to content

Implement gradient clipping #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 7, 2025
Merged

Implement gradient clipping #286

merged 6 commits into from
Jun 7, 2025

Conversation

tengyifei
Copy link
Collaborator

@tengyifei tengyifei commented Jun 6, 2025

Fixes #90

The Hugging Face trainer clips gradient by norm by default. It turns out this makes a big difference in training stability and simply adding gradient clipping will bring torchprime into parity with Hugging Face.

Experiment: http://tb/share/yuBUPRF6KqZKPhwJkXNvc

screenshot-2025-06-05-21-54-18

Tested:

tp run --name tp-linear-clip-norm-2 torchprime/torch_xla_models/train.py model=llama-3-8b ici_mesh.fsdp=256 profile_step=3 profile_duration=30000 task.max_steps=1000 logging_steps=1 task.global_batch_size=256 dataset.hf_dataset_config_name=wikitext-103-raw-v1 run_name=tp-linear-clip-norm-2

tp run --use-hf torchprime/hf_models/train.py train_script.args.per_device_train_batch_size=256 +train_script.args.log_loss=true train_script.args.logging_strategy=steps +train_script.args.logging_steps=1 +train_script.args.logging_first_step=true +train_script.args.report_to=tensorboard train_script.args.max_steps=1000

@tengyifei tengyifei marked this pull request as ready for review June 6, 2025 04:53
@tengyifei tengyifei enabled auto-merge (squash) June 6, 2025 05:23
Copy link
Collaborator

@yaoshiang yaoshiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grad clipping by norm is not the only way to clip grads. Clipping by absolute value is valid. Can you update this PR to enable both? Also, the grad clipping should be configuring outside the optimizer. The grad clipping is upstream of the optimizer - the optimizer has no knowledge of grad clipping - can you move the grad clipping config to a top-level section?

@tengyifei tengyifei requested a review from yaoshiang June 6, 2025 23:26
@tengyifei
Copy link
Collaborator Author

@yaoshiang done. ptal

@tengyifei tengyifei force-pushed the yifeit/fix-convergence branch from 86636b3 to 74c583c Compare June 6, 2025 23:27
Copy link
Collaborator

@yaoshiang yaoshiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! I realize you are pushing hard to finish this off, if you have cycles, please add a unit test of some kind. Numerical would be ideal but might take a bit of time. You could do like a single layer, with fixed numbers, have the loss function be MAE, so now you know the exact grads, try your clip, and see if you get what you expect. If you feed those instructions to an LLM it'll probably do a decent job. But consider it optional since you have numerical proof against the HF curves that the norm version at least is working./ Thanks!

@tengyifei tengyifei disabled auto-merge June 7, 2025 00:00
@tengyifei tengyifei enabled auto-merge (squash) June 7, 2025 08:22
@tengyifei tengyifei merged commit 3a2ce6f into main Jun 7, 2025
13 checks passed
@tengyifei tengyifei deleted the yifeit/fix-convergence branch June 7, 2025 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[torch_xla] MVP correctness and convergence check for Llama 3.0 8B
3 participants