Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Increasingly occupied GPU memory #108

Open
LeePleased opened this issue Aug 8, 2019 · 6 comments
Open

Increasingly occupied GPU memory #108

LeePleased opened this issue Aug 8, 2019 · 6 comments

Comments

@LeePleased
Copy link

LeePleased commented Aug 8, 2019

Hi~, I meet some problem while running your code on GPU. During training, the program will unexpectly consume GPU memory continuously, like from 2000 Mb -> 3000 Mb -> ... -> and finally explode. I use Python 3.6, Pytorch 0.4 and GPU with 12 GB memory.

@LeePleased
Copy link
Author

I guess it may relate to RNN compact weight, but I don't know how to fix it.

@S-Abdelnabi
Copy link

Hi,

Have you found a fix to this? I am having a similar issue. It was working on pytorch 0.4.1, the compact weight warning was displayed once at the beginning only, and it continues normally until the end of the training.
However, I updated to pytorch 1.2 and I am facing the same issue as yours. The warning is displayed at each call of the forward, and it stops training with OOM after around 100 epochs. I tried to call flatten_parameters() at the forward function of WeightDrop class. But I still get the warning.

Thanks a lot.

@rewicks
Copy link

rewicks commented May 14, 2020

Also having this issue. I don't think it's related to the flatten_parameters() warnings. It seems to be correlated with the optimizer--specifically, the memory usage only starts to increase after it is switched to ASGD.

@AndreaLK3
Copy link

@rewicks good call, the memory usage increases only with the ASGD optimizer. I think I have found the problem with it, but I am not sure how to solve it.

I printed the tensors living in memory using the GPU memory profiling code mentioned at https://discuss.pytorch.org/t/how-to-debug-causes-of-gpu-memory-leaks/6741/3 , and used the PyCharm debugger to see the variables during training.

The ASGD optimizer is an object that contains:

  • defaults: its default settings
  • param_groups = list containing 1 dictionary, with the hyperparameters ‘lr’, ‘alpha’, ‘lambd’, ‘t0’, ‘weight_decay’ and ’params’=a list of 14 Parameters w/Tensors
  • state = defaultdict:20 {Parameter containing Tensor, Parameter containing Tensor, etc etc.}

As the epochs go on and on, optimizer.state will contain 20,23,26,29,... (un-named) Tensors.
My hypothesis:

  • either the ASGD averages over all the previous epochs, and thus eventually breaks memory
  • or, more likely, the Tensors with the past gradients are never de-allocated from memory, we always allocate into new ones

Should we change the t0 parameter, increasing it by 1 each epoch? Or should we delete manually tensors from optimizer.state?
I would like to hear your opinions, and possibly from the authors as well - although maybe they didn't have this problem because they did not have resource constraints (I break memory on a GPU that has 10GB of memory)

@AndreaLK3
Copy link

AndreaLK3 commented Jun 6, 2020

I have found a solution. If it works for others as well, this issue can be closed.

I have modified the ASGD optimizer using @mourga's port of AWD-LSTM for PyTorch 1.2.0, from: https://github.com/mourga/awd-lstm-lm

In particular, in main.py, you have to replace:

  • lines 243-245 with :
for prm in model.parameters():
       if prm in optimizer.state.keys():
                    tmp[prm] = prm.data.detach()
                    prm.data = optimizer.state[prm]['ax'].detach()
  • lines 259-260 with:
for prm in model.parameters():
      if prm in tmp.keys():
             prm.data = tmp[prm].detach()
             prm.requires_grad = True
del tmp

@zhilizju
Copy link

zhilizju commented Oct 26, 2020

I have found a solution. If it works for others as well, this issue can be closed.

I have modified the ASGD optimizer using @mourga's port of AWD-LSTM for PyTorch 1.2.0, from: https://github.com/mourga/awd-lstm-lm

In particular, in main.py, you have to replace:

  • lines 243-245 with :
for prm in model.parameters():
       if prm in optimizer.state.keys():
                    tmp[prm] = prm.data.detach()
                    prm.data = optimizer.state[prm]['ax'].detach()
  • lines 259-260 with:
for prm in model.parameters():
      if prm in tmp.keys():
             prm.data = tmp[prm].detach()
             prm.requires_grad = True
del tmp

hi, @AndreaLK3. It works for me as well. However, I don‘t achieve the perplexities that this instruction declares.

The instruction below trains a PTB model that without finetuning achieves perplexities of approximately 61.2 / 58.8 (validation / testing) .python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save PTB.pt

I just achieve perplexities of 64.74/62.23(validation/testing) with the same command.
My torch version is 1.5.0 and cuda version is 10.1.
I'd like to know your experiment result and your advice.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants