You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using ShardingFilterIterDataPipe, the data in the datapipe will be evenly sharded to num_of_instances workers. However, if we called batch() later on the datapipe, the overly even distribution can cause workers to discard data that would not need to be discarded otherwise.
This might not be considered a bug, but it's kind of unexpected. Besides, the current ShardingFilterIterDataPipe will produce different batches of data for different number of workers, which is also kind of unexpected.
馃悰 Describe the bug
When using
ShardingFilterIterDataPipe
, the data in the datapipe will be evenly sharded tonum_of_instances
workers. However, if we calledbatch()
later on the datapipe, the overly even distribution can cause workers to discard data that would not need to be discarded otherwise.This might not be considered a bug, but it's kind of unexpected. Besides, the current
ShardingFilterIterDataPipe
will produce different batches of data for different number of workers, which is also kind of unexpected.This gives the following result:
One solution to this is to use a sharding filter that is aware of the batch size of the datapipe. Maybe something like the following:
set_batch_size()
needs to be called once the batch size is determined.I wonder what do the torchdata team think of the current sharding filter. Is its behavior expected?
Versions
torch 2.0.0
torchaudio 2.0.0
torchdata 0.6.0
The text was updated successfully, but these errors were encountered: