Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix]修复Trainer里check_code函数忽略pin_memory参数导致的内存bug #400

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

ouyhlan
Copy link
Contributor

@ouyhlan ouyhlan commented Nov 29, 2021

Description:修复Trainer里check_code函数忽略pin_memory参数导致的内存不足bug

Main reason:
在使用fastNLP库时发生内存不足错误。使用场景是在使用CPU训练模型时,发生了内存错误。经过DEBUG发现,是core/trainer.py文件里,_check_code函数在调用Tester类时没有指定pin_memory参数,而Tester类默认初始化pin_memory为True。

具体错误调用栈:

THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory
Traceback (most recent call last):
  File "/data/ouyhlan/TextClassification/main.py", line 52, in <module>
    trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 558, in __init__
    _check_code(dataset=train_data, model=self.model, losser=losser, forward_func=self._forward_func, metrics=metrics,
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 1013, in _check_code
    evaluate_results = tester.test()
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/tester.py", line 184, in test
    for batch_x, batch_y in data_iterator:
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/batch.py", line 266, in __iter__
    for indices, batch_x, batch_y in self.dataiter:
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 477, in _next_data
    data = _utils.pin_memory.pin_memory(data)
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
    return [pin_memory(sample) for sample in data]
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in <listcomp>
    return [pin_memory(sample) for sample in data]
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in pin_memory
    return {k: pin_memory(sample) for k, sample in data.items()}
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in <dictcomp>
    return {k: pin_memory(sample) for k, sample in data.items()}
  File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
    return data.pin_memory()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

pin_memory参数设为False后问题消失。同时,根据pytorch/pytorch#57273 ,建议所有的torch版本里Trainer和Tester类默认不开启pin_memory。

Checklist 检查下面各项是否完成

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (例如[bugfix]修复bug,[new]添加新功能,[test]修改测试,[rm]删除旧代码)
  • Changes are complete (i.e. I finished coding on this PR) 修改完成才提PR
  • All changes have test coverage 修改的部分顺利通过测试。对于fastnlp/fastnlp/的修改,测试代码必须提供在fastnlp/test/
  • Code is well-documented 注释写好,API文档会从注释中抽取
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change 修改导致例子或tutorial有变化,请找核心开发人员

Changes: 逐项描述修改的内容

  • Tester和Trainer类默认不开启pin_memory

Mention: 找人review你的PR
@yhcc

@ouyhlan ouyhlan changed the title 修复Trainer里check_code函数忽略pin_memory参数导致的内存bug [bugfix]修复Trainer里check_code函数忽略pin_memory参数导致的内存bug Nov 29, 2021
@yhcc
Copy link
Member

yhcc commented Dec 1, 2021

非常感谢您的再次提交,上次我没有认真检查您提到的细节,我刚才再次看了下这部分的修改内容。这里我倾向于默认打开,是由于大家真的在跑神经网络的时候,大部分时间都会在有gpu的服务器上,这点内存消耗应该对于服务器来说比较容易接受(这里是由于pin_memory确实会加速data准备过程,感觉默认开启可以为大家节省一点时间);遭遇了内存问题之后,也可以通过pin_memory手动关闭。我检查了一下代码,发现Trainer中check_code应该是默认就关闭了pin_memory吧?

_iter = DataSetIter(dataset, batch_size=batch_size, sampler=None)

@ouyhlan
Copy link
Contributor Author

ouyhlan commented Dec 2, 2021

@yhcc

非常感谢您的再次提交,上次我没有认真检查您提到的细节,我刚才再次看了下这部分的修改内容。这里我倾向于默认打开,是由于大家真的在跑神经网络的时候,大部分时间都会在有gpu的服务器上,这点内存消耗应该对于服务器来说比较容易接受(这里是由于pin_memory确实会加速data准备过程,感觉默认开启可以为大家节省一点时间);遭遇了内存问题之后,也可以通过pin_memory手动关闭。我检查了一下代码,发现Trainer中check_code应该是默认就关闭了pin_memory吧?

_iter = DataSetIter(dataset, batch_size=batch_size, sampler=None)

Trainer中check_code的问题是出现在下面这行代码:

tester = Tester(data=dev_data[:batch_size * DEFAULT_CHECK_NUM_BATCH], model=model, metrics=metrics,
batch_size=batch_size, verbose=-1, use_tqdm=False)

这行使用Tester的时候没有传入pin_memory参数,再看到Tester的初始化方法:
self.pin_memory = kwargs.get('pin_memory', True)

也就是说,不管是否给trainer传入pin_memory,这里的pin_memory都是默认开启的。
同时,在有gpu的服务器,我也复现过该内存问题。大概场景是同一个服务器有多人在不同卡上跑代码,导致服务器内存不是特别充足,那么pin_memory这部分的消耗就会有影响了。如果不想更改默认方式,也可以参考一下我之前提交的另外一种修复方式:ouyhlan@cac1331

@yhcc
Copy link
Member

yhcc commented Dec 2, 2021

嗯嗯,我觉得您提到的另一种方式应该会更好一些,

@yhcc
Copy link
Member

yhcc commented Dec 2, 2021

非常感谢~麻烦您按照提到的那份代码发起一下pr吧~

@ouyhlan
Copy link
Contributor Author

ouyhlan commented Dec 2, 2021

@yhcc

非常感谢~麻烦您按照提到的那份代码发起一下pr吧~

已经push -f到这个PR里了,麻烦直接在这个PR里review一下改动就好~

@ouyhlan
Copy link
Contributor Author

ouyhlan commented Dec 6, 2021

@yhcc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants