Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template Multinode GPU Error #566

Open
b-butler opened this issue Aug 20, 2021 · 1 comment
Open

Template Multinode GPU Error #566

b-butler opened this issue Aug 20, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@b-butler
Copy link
Member

Description

For the GreatLakes (and potentially others) template. Multi-node GPU submissions incorrectly set the --ntasks-per-node to be the total number of tasks disregarding individual node size.

To Reproduce

Just request a multi-GPU node submission with --pretend and view the output. Here is an example

#SBATCH --job-name="TempProject/42b7b4f2921788ea14dac5566e6f06d0/foo/13ee8c7cb17a11b218fe41a3e31afab3"
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gpus=8

Given the request that nranks=4 and ngpu=8, this should be --ntasks-per-node=2 as there are 2 GPUs per node for the GPU cluster of GreatLakes.

This problem may exist in other environments and was propagated to #561, so we should check other templates for this logical error.

@b-butler b-butler changed the title Potential Template Errors Template Multinode GPU Error Aug 20, 2021
@b-butler b-butler added the bug Something isn't working label Aug 20, 2021
@joaander
Copy link
Member

While #722 produces a proper resource request for sbatch, it fails to work correctly:
This request

#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --gpus=4
...
# hard_disk_nvt_gpu(agg-88ab2b6305a8b249743764fc16e5c22e)
/home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py run -o hard_disk_nvt_gpu -j agg-88ab2b6305a8b249743764fc16e5c22e
# Eligible to run:
# mpirun -n 4  /home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py exec hard_disk_nvt_gpu agg-88ab2b6305a8b249743764fc16e5c22e

produces:

An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

When I use this instead:

#SBATCH --ntasks=4                                                                                                                                                                                                                                                      
#SBATCH --gpus-per-task=1                                                                                                                                                                                                                                               
#SBATCH --mem-per-cpu="30g" 

mpirun is able to launch hoomd, but somehow SLURM_LOCALID is 0 on all ranks.... I will troubleshoot that further when testing the solution #777. In the meantime, multi-GPU jobs are still a bug on Great Lakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants