关于数据预处理部分问题 #5

Linjz1 · 2021-04-22T06:29:30Z

   您好，请问关于数据预处理部分，我的数据是imdb,yelp2013,yelp2014分别是10、5、5分类问题。数据是整合成像raw_data/sent/imdb下的数据，  句子+label 吗 ？（多分类问题需要改对应的代码吗） 之后使用preprocess/prep_sent.py 去进行预处理吗？
 谢谢您的阅读 ，期待您的回复！

The text was updated successfully, but these errors were encountered:

kepei1106 · 2021-04-22T10:41:00Z

您好。数据整理成raw_data/sent/imdb下的格式，然后运行preprocess/prep_sent.py即可，多分类不需要改代码。

Fine-tune的时候需要改一下代码，因为我的imdb是2分类，你可以直接修改我的imdb数据处理类代码，或者可以仿照着自己写一个。具体参考finetune/sent_data_utils_sentilr.py中的line 143-line 169，主要是改get_labels函数中的类别标签集合。

Linjz1 · 2021-04-23T11:35:23Z

嗯，数据已处理成您提供的格式，运行prep_sent.py后会报这种错误，您看这该如何解决？
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Traceback (most recent call last):
File "prep_sent.py", line 210, in
convert_sentence(path, task, sentinet, gloss_embedding, gloss_embedding_norm)
File "prep_sent.py", line 194, in convert_sentence
clean_text_list, pos_list, senti_list, clean_label_list = process_text(text_list, label_list, sentinet, gloss_embedding, gloss_embedding_norm)
File "prep_sent.py", line 117, in process_text
corpus_embedding = model.encode(sent_list_str, batch_size=64)
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py", line 194, in encode
out_features = self.forward(features)
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py", line 38, in forward
output_states = self.auto_model(**trans_features, return_dict=False)
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 969, in forward
past_key_values_length=past_key_values_length,
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/ljz_pytorch/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 207, in forward
embeddings += position_embeddings
RuntimeError: The size of tensor a (2142) must match the size of tensor b (512) at non-singleton dimension 1

kepei1106 · 2021-04-25T04:37:15Z

请问您在我提供的raw_data数据上能跑通吗？我这边跑我提供的raw_data数据是没问题的。您提供的traceback看起来像是sentence transformers编码句子的时候内部出现了问题：

File "prep_sent.py", line 117, in process_text
corpus_embedding = model.encode(sent_list_str, batch_size=64)

我猜测可能是sentence transformers和huggingface transformers的版本不匹配导致的，我的预处理环境如下：
transformers (huggingface) 2.3.0
sentence transformers 0.2.6

建议您先检查版本是否对应，然后再根据traceback信息进行debug。

Linjz1 · 2021-04-25T06:57:19Z

您提供的raw_data数据集我也跑不通。报错与我自己的数据集相同。您可以看我给你发的邮件（您论文中提供的邮箱）
我重新按照您说的预处理环境。重装了一下。
我直接pip install sentence-transformers==0.2.6 会出现这种错误。

Using cached https://pypi.tuna.tsinghua.edu.cn/packages/51/9d/cef25b5faabdc1b54d218012ee821292312e139e76cc40105c824ad024bb/sentence-transformers-0.2.6.tar.gz (55 kB)
ERROR: Command errored out with exit status 1:
command: /home/fzuirdata/anaconda3/envs/py37Lin/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5xf_3m8x/sentence-transformers/setup.py'"'"'; file='"'"'/tmp/pip-install-5xf_3m8x/sentence-transformers/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-6e0stt18
cwd: /tmp/pip-install-5xf_3m8x/sentence-transformers/
Complete output (5 lines):
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-5xf_3m8x/sentence-transformers/setup.py", line 6, in
with open('requirements.txt', mode="r", encoding="utf-8") as req_file:
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

所以我使用了另一种方法从github下载0.2.6的版本。
您说的transformers（huggingface） 2.3.0与 sentence-transformers0.2.6 并不兼容。
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 0.2.6 requires transformers>=2.8.0, but you have transformers 2.3.0 which is incompatible.
所以我只能默认选择 github下0.2.6版本所提供requirements.txt中默认的transformers版本。
现在的环境就是sentence-transformers0.2.6 transformers4.5.1 运行后会报这个错误。请问这个该如何让解决呢？
Traceback (most recent call last):
File "prep_sent.py", line 15, in
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')
File "/home/fzuirdata/anaconda3/envs/py36torch/lib/python3.6/site-packages/sentence_transformers/SentenceTransformer.py", line 75, in init
with open(os.path.join(model_path, 'modules.json')) as fIn:
FileNotFoundError: [Errno 2] No such file or directory: 'sentence-transformers/bert-base-nli-mean-tokens/modules.json'
您有对应的sentence-transformers0.2.6的包可以提供给我吗？邮箱[email protected]

kepei1106 · 2021-04-25T07:23:19Z

sentence transformers 0.2.6的requirements.txt里写的是transformers==2.3.0，至少我下载的这版是这样，使用的时候也没有因为不兼容而报错。包稍后发到您的邮箱。

您最后提到的这个问题：
FileNotFoundError: [Errno 2] No such file or directory: 'sentence-transformers/bert-base-nli-mean-tokens/modules.json'
原因是您没有下载sentence transformers的模型bert-base-nli-mean-tokens，或者下载后读取路径设置有误。我的代码里是按我设置的路径写的，您需要改为您自己的路径。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于数据预处理部分问题 #5

关于数据预处理部分问题 #5

Linjz1 commented Apr 22, 2021

kepei1106 commented Apr 22, 2021

Linjz1 commented Apr 23, 2021

kepei1106 commented Apr 25, 2021

Linjz1 commented Apr 25, 2021

kepei1106 commented Apr 25, 2021

关于数据预处理部分问题 #5

关于数据预处理部分问题 #5

Comments

Linjz1 commented Apr 22, 2021

kepei1106 commented Apr 22, 2021

Linjz1 commented Apr 23, 2021

kepei1106 commented Apr 25, 2021

Linjz1 commented Apr 25, 2021

kepei1106 commented Apr 25, 2021