Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrapper for Arrow Datasets & Dataset Pieces #754

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aperiodic
Copy link

This is a wrapper around PyArrow's ParquetDataset and ParquetDatasetPieces, the first part of the effort to support PyArrow's new Dataset API that was discussed in issue #613.

Since this wrapper has no functionality of its own, I didn't add unit tests specifically for the wrapper; let me know if I should.

I was unable to get the tests in petastorm/tests/test_tf_dataset.py to complete when running them inside the provided docker container, even when I left them to run for over 24 hours. Given that this is only a wrapper, and the code paths where the wrapper is used are covered by other tests, I think it's unlikely that tests in that file would fail when all other tests pass, but I'll keep an eye on the CI results. If they fail, then don't bother reviewing this until I fix them up (which may take several rounds of pushing new commits, as I can only run those tests in CI and not locally).

Add a wrapper around Arrow's ParquetDataset legacy class, to allow us to
re-implement that class's API using Arrow's new dataset class.

Add a wrapper around Arrow's ParquetDatasetPiece legacy class, to allow
us to re-implement that class's API using Arrow's new dataset "Fragment"
class.
@CLAassistant
Copy link

CLAassistant commented Apr 28, 2022

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Dan Lidral-Porter seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov
Copy link

codecov bot commented Apr 28, 2022

Codecov Report

Merging #754 (819a127) into master (d32709d) will increase coverage by 0.11%.
The diff coverage is 94.66%.

@@            Coverage Diff             @@
##           master     #754      +/-   ##
==========================================
+ Coverage   86.27%   86.38%   +0.11%     
==========================================
  Files          85       86       +1     
  Lines        5084     5141      +57     
  Branches      787      793       +6     
==========================================
+ Hits         4386     4441      +55     
  Misses        559      559              
- Partials      139      141       +2     
Impacted Files Coverage Δ
petastorm/etl/metadata_util.py 0.00% <0.00%> (ø)
petastorm/etl/rowgroup_indexing.py 64.51% <66.66%> (ø)
petastorm/pyarrow_helpers/dataset_wrapper.py 96.42% <96.42%> (ø)
petastorm/arrow_reader_worker.py 90.90% <100.00%> (ø)
petastorm/etl/dataset_metadata.py 87.41% <100.00%> (+0.08%) ⬆️
petastorm/etl/petastorm_generate_metadata.py 84.21% <100.00%> (ø)
petastorm/py_dict_reader_worker.py 95.45% <100.00%> (ø)
petastorm/reader.py 90.69% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d32709d...819a127. Read the comment docs.

Copy link
Collaborator

@selitvin selitvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you think we can keep it in a PR for now until we have a followup that builds on this PR with the actual new version of a dataset being used?

from pyarrow import parquet as pq


class PetastormPyArrowDataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining why do we need the wrapper.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants