Fine tune the pre-trained model on large dataset #573

Juo-kim · 2025-11-18T08:24:10Z

Juo-kim
Nov 18, 2025

Hello,
Thank you very much for providing such an excellent codebase.

I would like to fine-tune the pre-trained Nequip/Allegro OAM model on my dataset, which contains 124,900 configurations.
However, when I attempt to fine-tune the pre-trained model, the training does not proceed due to GPU memory issues.

I am using an H200 GPU with 140 GB of memory, and even after reducing the batch size, the memory issue persists.

Since you must have trained the pre-trained model on an even larger dataset, I am curious how you addressed this type of memory problem during your training process.
Thank you very much.

This is the slurm script that I tried

#!/bin/bash               
#SBATCH -J Allegro_train
#SBATCH -p amd_h200nv_8
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH -o %x.o%j
#SBATCH -e %x.e%j
#SBATCH --time 48:00:00
#SBATCH --gres=gpu:1
#SBATCH --comment pytorch     # See Application SBATCH options name table's

export OMP_NUM_THREADS=1

module purge
module load cuda/12.1

source ~/anaconda3/etc/profile.d/conda.sh

srun nequip-train -cn Allegro_train.yaml ++trainer.num_nodes=2

Allegro_train.yaml

cw-tan · 2025-11-18T14:25:36Z

cw-tan
Nov 18, 2025
Maintainer

Hi @Juo-kim

Some critical comments first.

The Triton contracter does not work with training, see note in the Allegro docs: https://nequip.readthedocs.io/projects/allegro/en/latest/guide/triton.html
See API for ModelFromPackage: https://nequip.readthedocs.io/en/latest/api/save_model.html#nequip.model.ModelFromPackage. You can still set the compile_mode: compile argument for train-time compilation, which should fuse operations and potentially lead to better memory savings.
You are strongly advised to update the per-type atomic energy shifts of the model to match your fine-tuning dataset.

Not as critical but just noting:

Note that the cutoff radius should match what the model is trained on. In your config file, the cutoff radius will only affect the data processing and the neighborlist generation. If that neighborlist is larger than the model's cutoff radius hyperparameter, the model will still do compute on the extra neighbors even if they are all zeros because they are out of the model's cutoff radius.
If you are not using dataset statistics anywhere, you can avoid computing it by removing

  stats_manager:
    _target_: nequip.data.CommonDataStatisticsManager
    dataloader_kwargs:
      batch_size: 24
    type_names: ${model_type_names}

I am curious if it actually is an OOM -- it would be very helpful if you show us the error message. I would guess that it's actually the use of the Triton Contracter that is causing crashes since it doesn't support the double backward required for the weight derivatives with respect to the force loss contribution.

2 replies

Juo-kim Nov 18, 2025
Author

Thank you for your detailed answers!

About critical comments :

I used the Triton contractor because the YAML file requires a modifier to be defined, but for some reason OpenEquivariance wasn’t available, and I wanted to avoid using modify_PerTypeScaleShift. Since the configuration seemed to require a modifier, I added Triton contractor just to satisfy that condition. Is there a way to run the model without defining any modifier at all?
Thank you — I’ll try running it again with the compile_mode option enabled.
Regarding the per-type atomic energy shifts, I’ve heard that it’s best to compute the energies of isolated single atoms using the same DFT settings used for the dataset. But if I want to fine-tune the model later, does that mean I need to perform new DFT single-atom calculations every time the parameters change? If there’s a simpler or more practical way to set per-type atomic energy shifts, I’d really appreciate your guidance.

just noting :

If I set the cutoff radius lower than the model’s internal cutoff, does that cause any problems, or is it generally safe?
Understood — thank you.

So far I’ve been dealing with constant CUDA OOM issues, and although the training runs when I reduce both the cutoff radius and batch size, it becomes extremely slow. I will remove the Triton contractor and run the training again, then upload the results.

Thank you again for all your help!

cw-tan Nov 18, 2025
Maintainer

Is there a way to run the model without defining any modifier at all?
yes, just

training_module:
  ...
  model:
    _target_: nequip.model.ModelFromPackage
    package_path: /path/to/pretrained/model.nequip.zip

But if I want to fine-tune the model later, does that mean I need to perform new DFT single-atom calculations every time the parameters change?

If you want to fine-tune on a new dataset, and that dataset comes with different DFT settings (especially pseudopotentials), then the shifts are likely to be fundamentally different. The actual energy value you get from a DFT calculation is kind of arbitrary up to some shift -- it's energy differences that matter. The shifting is important to account for this.

If I set the cutoff radius lower than the model’s internal cutoff, does that cause any problems, or is it generally safe?

That will be inconsistent because our inference engines will always use the correct model cutoff radius. So not safe at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine tune the pre-trained model on large dataset #573

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fine tune the pre-trained model on large dataset #573

Uh oh!

Uh oh!

Juo-kim Nov 18, 2025

Replies: 1 comment · 2 replies

Uh oh!

cw-tan Nov 18, 2025 Maintainer

Uh oh!

Juo-kim Nov 18, 2025 Author

Uh oh!

cw-tan Nov 18, 2025 Maintainer

Juo-kim
Nov 18, 2025

Replies: 1 comment 2 replies

cw-tan
Nov 18, 2025
Maintainer

Juo-kim Nov 18, 2025
Author

cw-tan Nov 18, 2025
Maintainer