Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[数据预处理-tokenization时报错] datasets.builder.DatasetGenerationError #269

Open
ShanJianSoda opened this issue Nov 30, 2023 · 0 comments

Comments

@ShanJianSoda
Copy link

按照步骤生成了jsonl文件
然后运行一下代码

python tokenize_dataset_rows.py ^
--jsonl_path data/alpaca_data.jsonl ^
--save_path data/alpaca ^
--max_seq_length 200 

报错

E:\ChatGLM\ChatGLM3\ChatGLM-LoRA>python tokenize_dataset_rows.py ^
More? --jsonl_path data/alpaca_data.jsonl ^
More? --save_path data/alpaca ^
More? --max_seq_length 200
  0%|                                                                                        | 0/52002 [00:00<?, ?it/s]
Generating train split: 0 examples [00:02, ? examples/s]                                     | 0/52002 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1676, in _prepare_split_single
    for key, record in generator:
  File "e:\anaconda3\Lib\site-packages\datasets\packaged_modules\generator\generator.py", line 30, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 31, in read_jsonl
    feature = preprocess(tokenizer, config, example, max_seq_length)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 10, in preprocess
    prompt = example["text"]
             ~~~~~~~^^^^^^^^
KeyError: 'text'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 53, in <module>
    main()
  File "E:\ChatGLM\ChatGLM3\ChatGLM-LoRA\tokenize_dataset_rows.py", line 46, in main
    dataset = datasets.Dataset.from_generator(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "e:\anaconda3\Lib\site-packages\datasets\arrow_dataset.py", line 1072, in from_generator
    ).read()
      ^^^^^^
  File "e:\anaconda3\Lib\site-packages\datasets\io\generator.py", line 47, in read
    self.builder.download_and_prepare(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1555, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "e:\anaconda3\Lib\site-packages\datasets\builder.py", line 1712, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

查阅信息,没有找到有效方法

有没有大佬邦邦鸭——

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant