datasets.Dataset.map too slow even with num_proc

After modifying my dataset processing pipeline, the speed of datasets.Dataset.map() remains slow (40-70 examples/sec). The dataset mapping operations are as follows:


padded_dataset = dataset.map(pad_sequence, batched=True, num_proc=data_args.preprocessing_num_workers)
sp_dataset = padded_dataset.map(sp_split, batched=True, num_proc=data_args.preprocessing_num_workers)
However, the speed is still too slow even after optimizing num_proc. The expected speed is much higher.

[rank1]: RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

datasets.Dataset.map too slow even with num_proc #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

datasets.Dataset.map too slow even with num_proc #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions