Replies: 1 comment 2 replies
-
|
Hi @Juo-kim Some critical comments first.
Not as critical but just noting:
I am curious if it actually is an OOM -- it would be very helpful if you show us the error message. I would guess that it's actually the use of the Triton Contracter that is causing crashes since it doesn't support the double backward required for the weight derivatives with respect to the force loss contribution. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Thank you very much for providing such an excellent codebase.
I would like to fine-tune the pre-trained Nequip/Allegro OAM model on my dataset, which contains 124,900 configurations.
However, when I attempt to fine-tune the pre-trained model, the training does not proceed due to GPU memory issues.
I am using an H200 GPU with 140 GB of memory, and even after reducing the batch size, the memory issue persists.
Since you must have trained the pre-trained model on an even larger dataset, I am curious how you addressed this type of memory problem during your training process.
Thank you very much.
This is the slurm script that I tried
Allegro_train.yaml
Beta Was this translation helpful? Give feedback.
All reactions