Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None #750

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

selitvin
Copy link
Collaborator

@selitvin selitvin commented Apr 8, 2022

Resolves #744

We implemented a Unischema->Pyarrow-schema conversion and explicitly
set the pyarrow schema when converting a pandas dataframe returned by
transform spec function into a pyarrow table. This way, pyarrow does not
have to guess the type of data from the data itself (which it obviously
could not do before, since all values were None).

@selitvin selitvin changed the title Fix type of the a batch returned by make_batch_reader when TransformSpec's function sets an entire column to None. Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None Apr 8, 2022
@selitvin selitvin closed this Apr 9, 2022
@selitvin selitvin reopened this Apr 12, 2022
@@ -209,6 +209,33 @@ def preproc_fn1(x):
(3, 4, 5) == sample['tensor_col_2'].shape[1:]


@pytest.mark.parametrize('null_column_dtype', [np.float64, np.unicode_])
@pytest.mark.parametrize('reader_factory', _D)
def test_transform_spec_returns_all_none_values(scalar_dataset, null_column_dtype, reader_factory):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test cases assume the transform_spec func is creating the null values. In the more common case, there are missing values in fields unedited by the transform_spec. I believe this solution already addresses both cases but it would be good to demonstrate this in the tests either with an additional test case or additional non-edited fields in this test case.

@pytest.mark.parametrize('reader_factory', _D)
def test_transform_spec_returns_all_none_values_in_a_list_field(scalar_dataset, reader_factory):
def fill_id_with_nones(x):
return pd.DataFrame({'int_fixed_size_list': [[None for _ in range(3)]] * len(x)})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also ran into the NoneType issue with lists of strings. Consider adding string types to the test as well.

The NoneType problem also occurs when only some of the values in the list are None, e.g. ['a', 'b', None]. What about a test_transform_spec_returns_some_none_values_in_a_list_field?

Yevgeni Litvin added 5 commits September 13, 2022 22:13
…rSpec's function sets an entire column to None.

Resolves uber#744

We implemented a Unischema->Pyarrow-schema conversion and explicitly
set the pyarrow schema when converting a pandas dataframe returned by
transform spec function into a pyarrow table. This way, pyarrow does not
have to guess the type of data from the data itself (which it obviously
could not do before, since all values were None).
The test tests properly columns with scalars only. Will need to verify
correct behavior with columns that are lists separately. Will extend the
tests in the following commits.
…all elements being None.

Change the type of numpy dtype to np.object instead of np.unicode_
@selitvin selitvin force-pushed the selitvin/fix_all_none_return_by_transformer branch from 5d28818 to 1fcf22f Compare September 14, 2022 05:58
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Yevgeni Litvin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use of transform_spec in make_batch_reader leads to tensorflow error when column is missing values
4 participants