-
Notifications
You must be signed in to change notification settings - Fork 58
Description
I noticed that after PR #1692 , when training a job that is already GPU mem intensive, GPU 0 uses much more mem than normal.
For example, for Hobot agent training with 2 envs per GPU and 5 eval envs, after the training actually starts, I checked gpu 0 mem usage.
With torch.cuda.synchronize() (20G total)
| 0 N/A N/A 1586336 C+G python 4539MiB |
| 0 N/A N/A 1586384 C /usr/bin/python 825MiB |
| 0 N/A N/A 1586385 C /usr/bin/python 825MiB |
| 0 N/A N/A 1586552 C /usr/bin/python 823MiB |
| 0 N/A N/A 1586553 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1586554 C /usr/bin/python 823MiB |
| 0 N/A N/A 1586555 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587277 C+G /usr/bin/python 939MiB |
| 0 N/A N/A 1587279 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587558 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587765 C /usr/bin/python 823MiB |
| 0 N/A N/A 1588504 C+G /usr/bin/python 2385MiB |
| 0 N/A N/A 1589074 C+G /usr/bin/python 939MiB |
| 0 N/A N/A 1589550 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589685 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589839 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589973 C+G /usr/bin/python 941MiB |
Without (10G total)
| 0 N/A N/A 1581927 C+G python 4539MiB |
| 0 N/A N/A 1581985 C /usr/bin/python 825MiB |
| 0 N/A N/A 1581986 C /usr/bin/python 825MiB |
| 0 N/A N/A 1581987 C /usr/bin/python 825MiB |
| 0 N/A N/A 1582144 G /usr/bin/python 118MiB |
| 0 N/A N/A 1582679 G /usr/bin/python 118MiB |
| 0 N/A N/A 1583963 C+G /usr/bin/python 2369MiB |
| 0 N/A N/A 1584240 G /usr/bin/python 116MiB |
| 0 N/A N/A 1584399 G /usr/bin/python 118MiB |
| 0 N/A N/A 1584532 G /usr/bin/python 118MiB |
| 0 N/A N/A 1584663 G /usr/bin/python 116MiB |
| 0 N/A N/A 1584795 G /usr/bin/python 118MiB |
This issue causes a serious trouble if we want to increase the num of envs per GPU because a CUDA out-of-mem issue will be reported.