训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk. #153

SKY-ZW · 2023-08-24T07:43:01Z

Traceback (most recent call last):
File "/content/zero_nlp/chatglm_v2_6b_lora/main.py", line 470, in
main()
File "/content/zero_nlp/chatglm_v2_6b_lora/main.py", line 133, in main
model = AutoModel.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 479, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2980, in _load_pretrained_model
raise ValueError(
ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format

The text was updated successfully, but these errors were encountered:

yuanzhoulvpi2017 · 2023-08-24T07:44:48Z

是否改了代码？
transformers版本是不是最新的，最好更新到最新的。

SKY-ZW · 2023-08-24T07:47:32Z

没有修改代码
Name: transformers
Version: 4.30.2

yuanzhoulvpi2017 · 2023-08-24T07:49:12Z

“The current device_map had weights offloaded to the disk” 这句话值得是设备没有选好，可能是被分配到disk（也就是硬盘上了）。你检查一下

SKY-ZW · 2023-08-24T07:53:31Z

我用的是COLAB ，是不是需要本地的显卡？

yuanzhoulvpi2017 · 2023-08-24T07:55:42Z

原理上来说，在colab里面也是可以跑的。但是这个报错就很奇怪。

我估计是transformers包、accelerate包、peft包的版本问题，你都更新到最新的版本试一试。

如果再出错，我也就不清楚了

SKY-ZW · 2023-08-24T08:03:01Z

我在colab里面全部pip install --upgrade 下面几个，版本如下，但是还是一样的错误，需要指定版本吗？

Name: transformers
Version: 4.30.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [[email protected]](mailto:[email protected])
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft
---
Name: accelerate
Version: 0.22.0
Summary: Accelerate
Home-page: https://github.com/huggingface/accelerate
Author: The HuggingFace team
Author-email: [[email protected]](mailto:[email protected])
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy, packaging, psutil, pyyaml, torch
Required-by: peft
---
Name: peft
Version: 0.5.0
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: [[email protected]](mailto:[email protected])
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, numpy, packaging, psutil, pyyaml, safetensors, torch, tqdm, transformers
Required-by:

yuanzhoulvpi2017 · 2023-08-24T08:04:26Z

我也搞不懂了～

SKY-ZW · 2023-08-24T08:06:21Z

我只是改了一下train.sh里面的--model_name_or_path /content/zero_nlp/chatglm_v2_6b_lora/chatglm2-6b 这个路径，其他都没动过

yuanzhoulvpi2017 · 2023-08-24T08:06:44Z

这个不影响的，不清楚了

SKY-ZW · 2023-08-25T08:24:11Z

我在133行加了offload_folder = "offload_folder" 以后就不报上面错了，但是又出现另外一个问题： Loading checkpoint shards: 71% 5/7 [01:03<00:23, 11.53s/it]train.sh: line 24: 5399 Killed

SKY-ZW · 2023-08-25T08:42:24Z

感觉是内存用多了被COLAB给KILL了

yuanzhoulvpi2017 added the chatglm2 label Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk. #153

训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk. #153

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023 •

edited by yuanzhoulvpi2017

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 25, 2023

SKY-ZW commented Aug 25, 2023

训练的时候报错ValueError: The current device_map had weights offloaded to the disk. #153

训练的时候报错ValueError: The current device_map had weights offloaded to the disk. #153

Comments

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023 • edited by yuanzhoulvpi2017

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 24, 2023

yuanzhoulvpi2017 commented Aug 24, 2023

SKY-ZW commented Aug 25, 2023

SKY-ZW commented Aug 25, 2023

训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk. #153

训练的时候报错ValueError: The current `device_map` had weights offloaded to the disk. #153

SKY-ZW commented Aug 24, 2023 •

edited by yuanzhoulvpi2017