How to start multi-gpus training in a single machine #89

kevinhuangxf · 2025-01-26T08:39:31Z

Thanks for the excellent work!

I encounter a problem of how to start multi-gpu training. I have 8 gpus but each I ran the training command line I can only start one GPU training:

I use this command:

python -m src.main +experiment=re10k data_loader.train.batch_size=14

Does it mean even I train on single node with multiple GPUs, I still need to use slurm to run multi gpus training?

The text was updated successfully, but these errors were encountered:

donydchen · 2025-02-18T02:45:33Z

Hi @kevinhuangxf, thanks for your appreciation. Normally, the current setting should automatically utilize all available GPUs for training. I'm not sure what might be causing this issue. You could try explicitly specifying the training devices to use all GPUs by following the instructions here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to start multi-gpus training in a single machine #89

How to start multi-gpus training in a single machine #89

kevinhuangxf commented Jan 26, 2025

donydchen commented Feb 18, 2025

How to start multi-gpus training in a single machine #89

How to start multi-gpus training in a single machine #89

Comments

kevinhuangxf commented Jan 26, 2025

donydchen commented Feb 18, 2025