-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please support prefetch with python datasets #5323
Comments
Do I assume correctly that you're using |
Thank you @tomvdw i will check that out! Just eyeballing it, one way to make grain more accessible would be to add usage examples to the readme so folks who look at the repo can get the gestalt. I tried clicking the docs link on the GitHub iOS app and it took me to a folder of code, so I'll take a look in there. I meekly suggest code examples above the fold are great advertising for any repo. Thank you for sharing |
Is your feature request related to a problem? Please describe.
There's a tremendous performance difference between datasets which are fully tensor end-to-end and datasets where some data wrangling happens in Python.
I was hoping to use "prefetch" to prepare data with the CPU while the GPU does work, but unfortunately, this only works if the data preparation is fully tensorflow-ish (? not sure the right term here)
Python IO operations are often exponentially slower and act as a bottleneck and prevent computers from keeping accelerators working at capacity.
Describe the solution you'd like
I wish tf.data.Dataset prefetch was more broadly compatible with prefetching of non-tensorflow vanilla-python data preparations.
Would it be possible for prefetch to use some performant C++ to sidestep Python GIL issues and juggle python data wrangling CPU processes alongside GPU training / inference without depending on such python CPU work happening in the main python driver process? I just want to be able to prefetch custom python datasets. Often there's some prep involved, not every dataset is tensorflow end-to-end.
i.e. instead of (python)->(gpu) what if it were
(python)->(cpp)
(cpp)->(python_prefetch)
(cpp)->(gpu/tpu accelerator)
Since C++ lacks a GIL, it could just run the python generator in a process which is isolated from the main python driver process' GIL, you'd still have a GIL per generator, but that's an easy fix, just run more python processes with different random seeds etc.
Describe alternatives you've considered
Torch DataLoader could be an option but it also seems to be a python-driven solution and therefore not super performant. I tried Threading, but pickling and unpickling overhead in python can be pretty bad. I think maybe C++ could run a background python process to prefetch python data in a more performant way than python could.
Additional context
Broader accessibility of custom dataset prefetching could enable new use-cases for tf.data especially in prototyping or infinite search spaces where it might not make sense to convert entire datasets to tensors in advance.
Apologies if I misunderstand the intricacies involved. I just want to prefetch datasets built from generators. I tried doing this last week but it didn't work, so hopefully I'm not mentioning an issue which is fixed and I missed a way to pull it off. It's hard to provide an example since the code in question is closed-source and quite extensive anyway. For a good example of when this might be handy, consider RL gym envs or datasets which involve making GET requests.
The text was updated successfully, but these errors were encountered: