-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH_15729: implement multi-thread as-data-frame using datatable [nocheck] #15771
Conversation
4964eb6
to
ecfc2c5
Compare
def test_frame_conversion(dataset, compareTime): | ||
print("Performing data conversion to pandas for dataset: {0}".format(dataset)) | ||
h2oFrame = h2o.import_file(pyunit_utils.locate(dataset)) | ||
# convert frame not using datatable | ||
startT = time.time() | ||
original_pandas_frame = h2oFrame.as_data_frame() | ||
oldTime = time.time()-startT | ||
print("H2O frame to Pandas frame conversion time: {0}".format(oldTime)) | ||
# convert frame using datatable | ||
startT = time.time() | ||
new_pandas_frame = h2oFrame.as_data_frame(multi_thread=True) | ||
newTime = time.time()-startT | ||
print("H2O frame to Pandas frame conversion time using datatable: {0}".format(newTime)) | ||
if compareTime: # disable for small dataset. Multi-thread can be slower due to more overhead | ||
assert newTime <= oldTime, " original frame conversion time: {0} should exceed new frame conversion time:" \ | ||
"{1} but is not.".format(oldTime, newTime) | ||
# compare two frames column types | ||
new_types = new_pandas_frame.dtypes | ||
old_types = original_pandas_frame.dtypes | ||
ncol = h2oFrame.ncol | ||
|
||
for ind in range(ncol): | ||
assert new_types[ind] == old_types[ind], "Expected column types: {0}, actual column types: " \ | ||
"{1}".format(old_types[ind], new_types[ind]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a problem with missing values in non-numeric colums. I modified this test a bit:
new_types = new_pandas_frame.dtypes
old_types = original_pandas_frame.dtypes
ncol = h2oFrame.ncol
-
+ print(dataset)
for ind in range(ncol):
+ print(new_pandas_frame.columns[ind], new_types[ind], h2oFrame.types[h2oFrame.columns[ind]])
assert new_types[ind] == old_types[ind], "Expected column types: {0}, actual column types: " \
"{1}".format(old_types[ind], new_types[ind])
+
+ if new_types[ind] == "object":
+ diff = new_pandas_frame.iloc[:, ind] == original_pandas_frame.iloc[:, ind]
+ if not diff.all():
+ print(original_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]])
+ print(new_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]])
+ print(original_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]].isna())
+ print(new_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]].isna())
+ assert diff.all()
+ else:
+ diff = (new_pandas_frame.iloc[:, ind] - original_pandas_frame.iloc[:, ind]).abs()
+ assert (diff.max() < 1e-10)
And found out the discrepancy. IIRC H2O exports missing values to csv as an empty string:
col1, col2, col3
1.618,,3.1415926535
This is read in pandas as if col2 is missing but it appears to me that in the datatable to pandas way the empty string is read in as an empty string and empty string is not a missing value in pandas.
original_pandas_frame
=====================
cabin object enum
9 NaN
13 NaN
15 NaN
23 NaN
25 NaN
...
1304 NaN
1305 NaN
1306 NaN
1307 NaN
1308 NaN
Name: cabin, Length: 1014, dtype: object
new_pandas_frame
================
9
13
15
23
25
..
1304
1305
1306
1307
1308
Name: cabin, Length: 1014, dtype: object
original_pandas_frame.isna()
============================
9 True
13 True
15 True
23 True
25 True
...
1304 True
1305 True
1306 True
1307 True
1308 True
Name: cabin, Length: 1014, dtype: bool
new_pandas_frame.isna()
=======================
9 False
13 False
15 False
23 False
25 False
...
1304 False
1305 False
1306 False
1307 False
1308 False
Name: cabin, Length: 1014, dtype: bool
I'm not sure if it's a big problem but we should at least document it somewhere (but I'd be in favor of a fix even if it should be something like new_frame.loc[h2o_frame[col].isna(), col] = pd.NA
(pseudocode as we can't index pandas by h2o column) but there is probably a better way e.g. specifying NA strings ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This problem needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you @wendycwong !
e6239fc
to
8049030
Compare
…led by H2O GH-15729: fix problem with missing values for string columns. GH-15729 fixed NAN issue for string columns Co-authored-by: Tomáš Frýda <[email protected]>
8049030
to
bd36f44
Compare
This PR resolves the issue here: #15729.
Basically, I use python datatable which is controlled by H2O-3 to perform the multi-thread conversion of H2O-3 frame to pandas frame.
In particulare, the column types are specified in the datatable.fread to make sure the data types are not changed.