Abnormal GPU 0 mem consumption when calling torch.cuda.synchronize() at creating envs

I noticed that after PR #1692 , when training a job that is already GPU mem intensive, GPU 0 uses much more mem than normal. 

For example, for Hobot agent training with 2 envs per GPU and 5 eval envs, after the training actually starts, I checked gpu 0 mem usage.

With `torch.cuda.synchronize()` (20G total)
|    0   N/A  N/A   1586336    C+G   python                           4539MiB |
|    0   N/A  N/A   1586384      C   /usr/bin/python                   825MiB |
|    0   N/A  N/A   1586385      C   /usr/bin/python                   825MiB |
|    0   N/A  N/A   1586552      C   /usr/bin/python                   823MiB |
|    0   N/A  N/A   1586553    C+G   /usr/bin/python                   941MiB |
|    0   N/A  N/A   1586554      C   /usr/bin/python                   823MiB |
|    0   N/A  N/A   1586555      C   /usr/bin/python                   823MiB |
|    0   N/A  N/A   1587277    C+G   /usr/bin/python                   939MiB |
|    0   N/A  N/A   1587279      C   /usr/bin/python                   823MiB |
|    0   N/A  N/A   1587558      C   /usr/bin/python                   823MiB |
|    0   N/A  N/A   1587765      C   /usr/bin/python                   823MiB |
|    0   N/A  N/A   1588504    C+G   /usr/bin/python                  2385MiB |
|    0   N/A  N/A   1589074    C+G   /usr/bin/python                   939MiB |
|    0   N/A  N/A   1589550    C+G   /usr/bin/python                   941MiB |
|    0   N/A  N/A   1589685    C+G   /usr/bin/python                   941MiB |
|    0   N/A  N/A   1589839    C+G   /usr/bin/python                   941MiB |
|    0   N/A  N/A   1589973    C+G   /usr/bin/python                   941MiB |

Without (10G total)
|    0   N/A  N/A   1581927    C+G   python                           4539MiB |
|    0   N/A  N/A   1581985      C   /usr/bin/python                   825MiB |
|    0   N/A  N/A   1581986      C   /usr/bin/python                   825MiB |
|    0   N/A  N/A   1581987      C   /usr/bin/python                   825MiB |
|    0   N/A  N/A   1582144      G   /usr/bin/python                   118MiB |
|    0   N/A  N/A   1582679      G   /usr/bin/python                   118MiB |
|    0   N/A  N/A   1583963    C+G   /usr/bin/python                  2369MiB |
|    0   N/A  N/A   1584240      G   /usr/bin/python                   116MiB |
|    0   N/A  N/A   1584399      G   /usr/bin/python                   118MiB |
|    0   N/A  N/A   1584532      G   /usr/bin/python                   118MiB |
|    0   N/A  N/A   1584663      G   /usr/bin/python                   116MiB |
|    0   N/A  N/A   1584795      G   /usr/bin/python                   118MiB |

This issue causes a serious trouble if we want to increase the num of envs per GPU because a CUDA out-of-mem issue will be reported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Abnormal GPU 0 mem consumption when calling torch.cuda.synchronize() at creating envs #1702

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Abnormal GPU 0 mem consumption when calling torch.cuda.synchronize() at creating envs #1702

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions