Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize encoding of a single row #546

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

selitvin
Copy link
Collaborator

When writing data into a petastorm dataset. Before a pyspark sql.Row
object is created, fields containing data that is not natively supported
by Parqyet format, such as numpy arrays, are serialized into byte
arrays. Images maybe compressed using png or jpeg compression.

Serializing fields on a thread pool speeds up this process in some
cases (e.g. a row contains multiple images).

@selitvin selitvin marked this pull request as draft April 20, 2020 05:28
Fields that contain data that is not natively supported
by Parqyet format, such as numpy arrays, are serialized into byte
arrays. Images maybe compressed using png or jpeg compression.

Serializing fields on a thread pool speeds up this process in some
cases (e.g. a row contains multiple images).

This PR adds a pool executor argument to `dict_to_spark_row` enabling
user to pass a pool executor that would be used for parallelizing
this serialization. If no pool executor is specified, the
encoding/serialization is performed on the caller thread.
@codecov
Copy link

codecov bot commented Apr 20, 2020

Codecov Report

Base: 82.88% // Head: 85.99% // Increases project coverage by +3.11% 🎉

Coverage data is based on head (3fe68d4) compared to base (83a02df).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #546      +/-   ##
==========================================
+ Coverage   82.88%   85.99%   +3.11%     
==========================================
  Files          85       87       +2     
  Lines        4721     4935     +214     
  Branches      744      783      +39     
==========================================
+ Hits         3913     4244     +331     
+ Misses        678      568     -110     
+ Partials      130      123       -7     
Impacted Files Coverage Δ
petastorm/unischema.py 96.91% <100.00%> (+1.12%) ⬆️
petastorm/reader_impl/pytorch_shuffling_buffer.py 96.42% <0.00%> (ø)
petastorm/benchmark/dummy_reader.py 0.00% <0.00%> (ø)
petastorm/py_dict_reader_worker.py 95.23% <0.00%> (+0.79%) ⬆️
petastorm/spark/spark_dataset_converter.py 91.76% <0.00%> (+1.49%) ⬆️
petastorm/pytorch.py 94.21% <0.00%> (+1.53%) ⬆️
petastorm/arrow_reader_worker.py 92.00% <0.00%> (+2.00%) ⬆️
petastorm/compat.py 100.00% <0.00%> (+39.02%) ⬆️
..._dataset_converter/tests/test_converter_example.py 100.00% <0.00%> (+46.66%) ⬆️
examples/spark_dataset_converter/utils.py 100.00% <0.00%> (+62.50%) ⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Yevgeni Litvin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants