[Bug]: 开启checkpoint，当配置的一个pipline执行到最后一个算子时，np>left samples，在开启checkpoint会报错。 #531

HunterLG · 2025-01-06T13:36:50Z

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

从源码安装

Data-Juicer Version Data-Juicer版本

v1.0.2

Python Version Python版本

3.10

Describe the bug 描述这个bug

开启checkpoint，当配置的一个pipline跑到最后一个算子时，剩余的样本为0或1，在开启checkpoint会报错。

To Reproduce 如何复现

运行python tools/process_data.py --config configs/demo/process.yaml命令

Configs 配置信息

Process config example for dataset

global parameters

project_name: 'demo-process'
dataset_path: "/xxx/data_juicer_main/ours/our_test_dataset/original_dataset.jsonl" # path to your dataset directory or file
np: 2 # number of subprocess to process your dataset
ds_cache_dir: '/xxx/data_juicer_main/huggingface'
export_path: './outputs/demo-process/demo-processed.jsonl'
op_fusion: false
use_checkpoint: true
#fusion_strategy: 'probe'

process schedule

a list of several process operators with their arguments

process:
- clean_email_mapper:
- clean_links_mapper:
- clean_ip_mapper:
- remove_repeat_sentences_mapper:

- language_id_score_filter: # filter text in specific language with language scores larger than a specific max value
      lang: en                                                # keep text in what language
      min_score: 0.8
- words_num_filter: # filter text with number of words out of specific range
      lang: en                                                # sample in which language
      tokenization: false                                     # whether to use model to tokenize documents
      min_num: 10                                             # the min number of filter range
      max_num: 10000
- maximum_line_length_filter: # filter text with the maximum length of lines out of specific range
      min_len: 10                                             # the min length of filter range
      max_len: 10000
- perplexity_filter:
    lang: en
    max_ppl: 8000

- document_simhash_deduplicator:
    tokenization: space
    window_size: 6
    lowercase: true
    ignore_pattern: '\p{P}'
    num_blocks: 6
    hamming_distance: 4
- document_deduplicator: # deduplicate text samples using md5 hashing exact matching method
      lowercase: false                                        # whether to convert text to lower case
      ignore_non_character: false
- random_selector: # selector to random select samples
      select_ratio:  0.5                                         # the ratio to be sampled
      select_num: 10

Logs 报错日志

剩余样本为1时报错：
2024-07-08 05:07:09 | INFO | data_juicer.core.data:225 - [11/11] OP [random_selector] Done in 0.709s. Left 1 samples.
2024-07-08 05:07:09 | INFO | data_juicer.core.data:241 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/2 shards): 0%| | 0/1 [00:00<?, ? examples/s]
2024-07-08 05:07:09 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (1378079), thread 'MainThread' (140001744877376):
Traceback (most recent call last):

File "/data/data_juicer_main/tools/process_data.py", line 19, in
main()
└ <function main at 0x7f53718f9240>

File "/data/data_juicer_main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function Executor.run at 0x7f53718f9120>
└ <data_juicer.core.executor.Executor object at 0x7f5370f02680>

File "/data/data_juicer_main/data_juicer/core/executor.py", line 196, in run
dataset = dataset.process(
│ └ <function NestedDataset.process at 0x7f53718db490>
└ Dataset({
features: ['text', 'meta', 'dj__stats'],
num_rows: 11
})

File "/data/data_juicer_main/data_juicer/core/data.py", line 244, in process
checkpointer.save_ckpt(dataset)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ <function CheckpointManager.save_ckpt at 0x7f53718f8940>
└ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f5370edd510>

File "/data/data_juicer_main/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt
ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc)
│ │ │ │ │ └ 2
│ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f5370edd510>
│ │ │ └ '/data/NewDataPilotEngine/data_juicer_main/outputs/demo-process/ckpt/latest'
│ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f5370edd510>
│ └ <function Dataset.save_to_disk at 0x7f53b5c013f0>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})

File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1547, in save_to_disk
for job_id, done, content in iflatmap_unordered(
└ <function iflatmap_unordered at 0x7f53b5ceee60>
File "/data/xxx/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 698, in iflatmap_unordered
async_results = [
File "/data/xxx/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 698, in
async_results = [
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1536, in
"shard": self.shard(num_shards=num_shards, index=shard_idx, contiguous=True),
│ │ │ └ 1
│ │ └ 2
│ └ <function Dataset.shard at 0x7f53b5c14040>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4716, in shard
return self.select(
│ └ <function NestedDataset.select at 0x7f53718db6d0>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})

File "/data/data_juicer_main/data_juicer/core/data.py", line 367, in select
return nested_obj_factory(super().select(*args, **kargs))
│ │ └ {'indices': range(1, 1), 'keep_in_memory': False, 'indices_cache_file_name': None, 'writer_batch_size': 1000}
│ └ ()
└ <function nested_obj_factory at 0x7f53718daf80>

File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'indices': range(1, 1), 'keep_in_memory': False, 'indices_cache_file_name': None, 'writer_batch_size': 1000}
│ │ │ └ ()
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ <function Dataset.select at 0x7f53b5c035b0>
└ typing.Union
File "/data/xxx/lib/python3.10/site-packages/datasets/fingerprint.py", line 442, in wrapper
out = func(dataset, *args, **kwargs)
│ │ │ └ {'indices': range(1, 1), 'keep_in_memory': False, 'indices_cache_file_name': None, 'writer_batch_size': 1000, 'new_fingerprin...
│ │ └ ()
│ └ Dataset({
│ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ num_rows: 1
│ })
└ <function Dataset.select at 0x7f53b5c03520>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3866, in select
return self._select_contiguous(start, length, new_fingerprint=new_fingerprint)
│ │ │ │ └ '988d5b9afaef6a64'
│ │ │ └ 0
│ │ └ 1
│ └ <function Dataset._select_contiguous at 0x7f53b5c03640>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'new_fingerprint': '988d5b9afaef6a64'}
│ │ │ └ (1, 0)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ <function Dataset._select_contiguous at 0x7f53b5c03760>
└ typing.Union
File "/data/xxx/lib/python3.10/site-packages/datasets/fingerprint.py", line 442, in wrapper
out = func(dataset, *args, **kwargs)
│ │ │ └ {'new_fingerprint': '988d5b9afaef6a64'}
│ │ └ (1, 0)
│ └ Dataset({
│ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ num_rows: 1
│ })
└ <function Dataset._select_contiguous at 0x7f53b5c036d0>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3926, in _select_contiguous
_check_valid_indices_value(start, len(self))
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ 1
└ <function _check_valid_indices_value at 0x7f53b5c00820>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 622, in _check_valid_indices_value
raise IndexError(f"Index {index} out of range for dataset of size {size}.")

IndexError: Index 1 out of range for dataset of size 1.

剩余样本为0时报错：
2024-07-08 00:02:32 | INFO | data_juicer.core.data:225 - [10/11] OP [document_deduplicator] Done in 0.714s. Left 0 samples.
2024-07-08 00:02:33 | INFO | data_juicer.core.data:225 - [11/11] OP [random_selector] Done in 0.799s. Left 0 samples.
2024-07-08 00:02:33 | INFO | data_juicer.core.data:241 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/2 shards): 0 examples [00:00, ? examples/s]Process ForkPoolWorker-25:
Traceback (most recent call last):
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/datapilotengine/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1318, in setstate
table = self._concat_blocks_horizontally_and_vertically(blocks)
Process ForkPoolWorker-26:
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1351, in _concat_blocks_horizontally_and_vertically
return cls._concat_blocks(pa_tables_to_concat_vertically, axis=0)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1331, in _concat_blocks
return pa.concat_tables(pa_tables, promote_options="default")
File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Must pass at least one table
Traceback (most recent call last):
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/xxx/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1318, in setstate
table = self._concat_blocks_horizontally_and_vertically(blocks)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1351, in _concat_blocks_horizontally_and_vertically
return cls._concat_blocks(pa_tables_to_concat_vertically, axis=0)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1331, in _concat_blocks
return pa.concat_tables(pa_tables, promote_options="default")
File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Must pass at least one table
Saving the dataset (0/2 shards): 0 examples [00:00, ? examples/s]
2024-07-08 00:02:33 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (1219391), thread 'MainThread' (139747673184064):
Traceback (most recent call last):

File "/data/data_juicer_main/tools/process_data.py", line 19, in
main()
└ <function main at 0x7f1845cd9240>

File "/data/data_juicer_main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function Executor.run at 0x7f1845cd9120>
└ <data_juicer.core.executor.Executor object at 0x7f18452126b0>

File "/data/data_juicer_main/data_juicer/core/executor.py", line 196, in run
dataset = dataset.process(
│ └ <function NestedDataset.process at 0x7f1845cbb490>
└ Dataset({
features: ['text', 'meta'],
num_rows: 5
})

File "/data/data_juicer_main/data_juicer/core/data.py", line 244, in process
checkpointer.save_ckpt(dataset)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats'],
│ │ num_rows: 0
│ │ })
│ └ <function CheckpointManager.save_ckpt at 0x7f1845cd8940>
└ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f18452d15d0>

File "/data/data_juicer_main/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt
ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc)
│ │ │ │ │ └ 2
│ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f18452d15d0>
│ │ │ └ '/data/NewDataPilotEngine/data_juicer_main/outputs/demo-process/ckpt/latest'
│ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f18452d15d0>
│ └ <function Dataset.save_to_disk at 0x7f188de9d3f0>
└ Dataset({
features: ['text', 'meta', 'dj__stats'],
num_rows: 0
})

File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1547, in save_to_disk
for job_id, done, content in iflatmap_unordered(
└ <function iflatmap_unordered at 0x7f188df8ae60>
File "/data/xxx/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 711, in iflatmap_unordered
raise RuntimeError(

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

Screenshots 截图

No response

Additional 额外信息

No response

The text was updated successfully, but these errors were encountered:

HunterLG · 2025-01-07T03:39:23Z

checkpoint开启，保存文件需要合理分配资源

HYLcool · 2025-01-09T08:56:14Z

嗨 @HunterLG ，感谢你对 Data-Juicer 的关注与使用！

我们成功复现了你所说的情况，确实存在这个问题，我们会尽快修复~

HunterLG added the bug Something isn't working label Jan 6, 2025

github-project-automation bot added this to data-juicer Jan 6, 2025

github-project-automation bot moved this to Todo in data-juicer Jan 6, 2025

HunterLG closed this as not planned Won't fix, can't repro, duplicate, stale Jan 7, 2025

github-project-automation bot moved this from Todo to Done in data-juicer Jan 7, 2025

HunterLG reopened this Jan 8, 2025

github-project-automation bot moved this from Done to In Progress in data-juicer Jan 8, 2025

HYLcool self-assigned this Jan 9, 2025

yxdyc added the dj:core issues/PRs about the core functions of Data-Juicer label Jan 9, 2025

HYLcool linked a pull request Jan 9, 2025 that will close this issue

fix save_ckpt bug #536

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: 开启checkpoint，当配置的一个pipline执行到最后一个算子时，np>left samples，在开启checkpoint会报错。 #531

[Bug]: 开启checkpoint，当配置的一个pipline执行到最后一个算子时，np>left samples，在开启checkpoint会报错。 #531

HunterLG commented Jan 6, 2025

HunterLG commented Jan 7, 2025 •

edited

Loading

HYLcool commented Jan 9, 2025

[Bug]: 开启checkpoint，当配置的一个pipline执行到最后一个算子时，np>left samples，在开启checkpoint会报错。 #531

[Bug]: 开启checkpoint，当配置的一个pipline执行到最后一个算子时，np>left samples，在开启checkpoint会报错。 #531

Comments

HunterLG commented Jan 6, 2025

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Process config example for dataset

global parameters

process schedule

a list of several process operators with their arguments

Logs 报错日志

Screenshots 截图

Additional 额外信息

HunterLG commented Jan 7, 2025 • edited Loading

HYLcool commented Jan 9, 2025

HunterLG commented Jan 7, 2025 •

edited

Loading