Please support prefetch with python datasets #5323

bionicles · 2024-03-14T10:58:34Z

Is your feature request related to a problem? Please describe.
There's a tremendous performance difference between datasets which are fully tensor end-to-end and datasets where some data wrangling happens in Python.

I was hoping to use "prefetch" to prepare data with the CPU while the GPU does work, but unfortunately, this only works if the data preparation is fully tensorflow-ish (? not sure the right term here)

Python IO operations are often exponentially slower and act as a bottleneck and prevent computers from keeping accelerators working at capacity.

Describe the solution you'd like
I wish tf.data.Dataset prefetch was more broadly compatible with prefetching of non-tensorflow vanilla-python data preparations.

Would it be possible for prefetch to use some performant C++ to sidestep Python GIL issues and juggle python data wrangling CPU processes alongside GPU training / inference without depending on such python CPU work happening in the main python driver process? I just want to be able to prefetch custom python datasets. Often there's some prep involved, not every dataset is tensorflow end-to-end.

i.e. instead of (python)->(gpu) what if it were

(python)->(cpp)
(cpp)->(python_prefetch)
(cpp)->(gpu/tpu accelerator)

Since C++ lacks a GIL, it could just run the python generator in a process which is isolated from the main python driver process' GIL, you'd still have a GIL per generator, but that's an easy fix, just run more python processes with different random seeds etc.

Describe alternatives you've considered
Torch DataLoader could be an option but it also seems to be a python-driven solution and therefore not super performant. I tried Threading, but pickling and unpickling overhead in python can be pretty bad. I think maybe C++ could run a background python process to prefetch python data in a more performant way than python could.

Additional context
Broader accessibility of custom dataset prefetching could enable new use-cases for tf.data especially in prototyping or infinite search spaces where it might not make sense to convert entire datasets to tensors in advance.

Apologies if I misunderstand the intricacies involved. I just want to prefetch datasets built from generators. I tried doing this last week but it didn't work, so hopefully I'm not mentioning an issue which is fixed and I missed a way to pull it off. It's hard to provide an example since the code in question is closed-source and quite extensive anyway. For a good example of when this might be handy, consider RL gym envs or datasets which involve making GET requests.

tomvdw · 2024-03-15T10:17:47Z

Do I assume correctly that you're using tfds.data_source to load the data? If so, one option is to use Grain to load your data. IIUC Grain does prefetch the data. If you're doing random access, prefetching is hard because you don't know what record will be loaded next.

bionicles · 2024-03-28T12:46:49Z

Thank you @tomvdw i will check that out! Just eyeballing it, one way to make grain more accessible would be to add usage examples to the readme so folks who look at the repo can get the gestalt. I tried clicking the docs link on the GitHub iOS app and it took me to a folder of code, so I'll take a look in there.

I meekly suggest code examples above the fold are great advertising for any repo. Thank you for sharing

bionicles added the enhancement New feature or request label Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please support prefetch with python datasets #5323

Please support prefetch with python datasets #5323

bionicles commented Mar 14, 2024 •

edited

tomvdw commented Mar 15, 2024

bionicles commented Mar 28, 2024

Please support prefetch with python datasets #5323

Please support prefetch with python datasets #5323

Comments

bionicles commented Mar 14, 2024 • edited

tomvdw commented Mar 15, 2024

bionicles commented Mar 28, 2024

bionicles commented Mar 14, 2024 •

edited