Open
Description
After modifying my dataset processing pipeline, the speed of datasets.Dataset.map() remains slow (40-70 examples/sec). The dataset mapping operations are as follows:
padded_dataset = dataset.map(pad_sequence, batched=True, num_proc=data_args.preprocessing_num_workers)
sp_dataset = padded_dataset.map(sp_split, batched=True, num_proc=data_args.preprocessing_num_workers)
However, the speed is still too slow even after optimizing num_proc. The expected speed is much higher.
[rank1]: RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
Metadata
Metadata
Assignees
Labels
No labels