Skip to content

Abnormal GPU 0 mem consumption when calling torch.cuda.synchronize() at creating envs #1702

@hnyu

Description

@hnyu

I noticed that after PR #1692 , when training a job that is already GPU mem intensive, GPU 0 uses much more mem than normal.

For example, for Hobot agent training with 2 envs per GPU and 5 eval envs, after the training actually starts, I checked gpu 0 mem usage.

With torch.cuda.synchronize() (20G total)
| 0 N/A N/A 1586336 C+G python 4539MiB |
| 0 N/A N/A 1586384 C /usr/bin/python 825MiB |
| 0 N/A N/A 1586385 C /usr/bin/python 825MiB |
| 0 N/A N/A 1586552 C /usr/bin/python 823MiB |
| 0 N/A N/A 1586553 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1586554 C /usr/bin/python 823MiB |
| 0 N/A N/A 1586555 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587277 C+G /usr/bin/python 939MiB |
| 0 N/A N/A 1587279 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587558 C /usr/bin/python 823MiB |
| 0 N/A N/A 1587765 C /usr/bin/python 823MiB |
| 0 N/A N/A 1588504 C+G /usr/bin/python 2385MiB |
| 0 N/A N/A 1589074 C+G /usr/bin/python 939MiB |
| 0 N/A N/A 1589550 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589685 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589839 C+G /usr/bin/python 941MiB |
| 0 N/A N/A 1589973 C+G /usr/bin/python 941MiB |

Without (10G total)
| 0 N/A N/A 1581927 C+G python 4539MiB |
| 0 N/A N/A 1581985 C /usr/bin/python 825MiB |
| 0 N/A N/A 1581986 C /usr/bin/python 825MiB |
| 0 N/A N/A 1581987 C /usr/bin/python 825MiB |
| 0 N/A N/A 1582144 G /usr/bin/python 118MiB |
| 0 N/A N/A 1582679 G /usr/bin/python 118MiB |
| 0 N/A N/A 1583963 C+G /usr/bin/python 2369MiB |
| 0 N/A N/A 1584240 G /usr/bin/python 116MiB |
| 0 N/A N/A 1584399 G /usr/bin/python 118MiB |
| 0 N/A N/A 1584532 G /usr/bin/python 118MiB |
| 0 N/A N/A 1584663 G /usr/bin/python 116MiB |
| 0 N/A N/A 1584795 G /usr/bin/python 118MiB |

This issue causes a serious trouble if we want to increase the num of envs per GPU because a CUDA out-of-mem issue will be reported.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions