-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrapper for Arrow Datasets & Dataset Pieces #754
base: master
Are you sure you want to change the base?
Conversation
Add a wrapper around Arrow's ParquetDataset legacy class, to allow us to re-implement that class's API using Arrow's new dataset class. Add a wrapper around Arrow's ParquetDatasetPiece legacy class, to allow us to re-implement that class's API using Arrow's new dataset "Fragment" class.
Dan Lidral-Porter seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Codecov Report
@@ Coverage Diff @@
## master #754 +/- ##
==========================================
+ Coverage 86.27% 86.38% +0.11%
==========================================
Files 85 86 +1
Lines 5084 5141 +57
Branches 787 793 +6
==========================================
+ Hits 4386 4441 +55
Misses 559 559
- Partials 139 141 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Do you think we can keep it in a PR for now until we have a followup that builds on this PR with the actual new version of a dataset being used?
from pyarrow import parquet as pq | ||
|
||
|
||
class PetastormPyArrowDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining why do we need the wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good feature
This is a wrapper around PyArrow's ParquetDataset and ParquetDatasetPieces, the first part of the effort to support PyArrow's new Dataset API that was discussed in issue #613.
Since this wrapper has no functionality of its own, I didn't add unit tests specifically for the wrapper; let me know if I should.
I was unable to get the tests in
petastorm/tests/test_tf_dataset.py
to complete when running them inside the provided docker container, even when I left them to run for over 24 hours. Given that this is only a wrapper, and the code paths where the wrapper is used are covered by other tests, I think it's unlikely that tests in that file would fail when all other tests pass, but I'll keep an eye on the CI results. If they fail, then don't bother reviewing this until I fix them up (which may take several rounds of pushing new commits, as I can only run those tests in CI and not locally).