-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of transform_spec in make_batch_reader leads to tensorflow error when column is missing values #744
Comments
…rSpec's function sets an entire column to None. Resolves uber#744 We implemented a Unischema->Pyarrow-schema conversion and explicitly set the pyarrow schema when converting a pandas dataframe returned by transform spec function into a pyarrow table. This way, pyarrow does not have to guess the type of data from the data itself (which it obviously could not do before, since all values were None).
Thanks for bringing up the issue. I tried forcing a strict schema type when converting back from pandas to pyarrow table here. Unfortunately the "trip" to pandas and back is not transparent. One type that ended up being tricky is a pa.timestamp. I was not able to make it work without implementing some weird conversion code which I am not sure if it will be robust enough. Another approach that I tried is to use pyarrow.Table type as an argument to TransformSpec function (instead of pandas dataframe). However, working with pyarrow.Table type in the transform spec function appears to be inconvenient since pa.Table is immutable and pandas API is much more convenient for a transformation implementation. So after doing all this, I would suggest sticking with the current implementation. While it's not perfect, I was not able to find a better alternative that would not require implementation of potentially non robust code. Would appreciate your thought and suggestions on this matter. |
Thanks for looking into this! What was the issue with pa.timestamp? I'm not seeing the timestamp-specific conversion code in #750. For our purposes, using the proposed workaround is not a big deal as we only use a single |
The issue with timestamps I ran into was the automatic conversion of the timestamp into a datetime object - it would not be automatically converted back into pa.timestamp64. However, I just noticed that there is a |
…rSpec's function sets an entire column to None. Resolves uber#744 We implemented a Unischema->Pyarrow-schema conversion and explicitly set the pyarrow schema when converting a pandas dataframe returned by transform spec function into a pyarrow table. This way, pyarrow does not have to guess the type of data from the data itself (which it obviously could not do before, since all values were None).
The conversion back to arrow from pandas in ArrowReaderWorker._load_rows() loses type information when all rows in the loaded row group are missing values for a given column. From the
pyarrow.Table.from_pandas
documentation:The result is the following error when reading the corresponding batch in the TensorFlow
Dataset
:Example
Workaround
Modify all
TransformSpec
funcs to replace string columns missing all values withNone
strings.Full Trace
The text was updated successfully, but these errors were encountered: