GH_15729: implement multi-thread as-data-frame using datatable [nocheck] #15771

wendycwong · 2023-09-18T21:53:12Z

This PR resolves the issue here: #15729.

Basically, I use python datatable which is controlled by H2O-3 to perform the multi-thread conversion of H2O-3 frame to pandas frame.

In particulare, the column types are specified in the datatable.fread to make sure the data types are not changed.

tomasfryda · 2023-09-19T12:51:59Z

h2o-py/tests/testdir_misc/pyunit_gh_15729_datatable_2_pandas.py

+def test_frame_conversion(dataset, compareTime):
+    print("Performing data conversion to pandas for dataset: {0}".format(dataset))
+    h2oFrame = h2o.import_file(pyunit_utils.locate(dataset))
+     # convert frame not using datatable
+    startT = time.time()
+    original_pandas_frame = h2oFrame.as_data_frame()
+    oldTime = time.time()-startT
+    print("H2O frame to Pandas frame conversion time: {0}".format(oldTime))
+    # convert frame using datatable
+    startT = time.time()
+    new_pandas_frame = h2oFrame.as_data_frame(multi_thread=True)
+    newTime = time.time()-startT
+    print("H2O frame to Pandas frame conversion time using datatable: {0}".format(newTime))
+    if compareTime: # disable for small dataset.  Multi-thread can be slower due to more overhead
+        assert newTime <= oldTime, " original frame conversion time: {0} should exceed new frame conversion time:" \
+                                   "{1} but is not.".format(oldTime, newTime)
+    # compare two frames column types                
+    new_types = new_pandas_frame.dtypes
+    old_types = original_pandas_frame.dtypes
+    ncol = h2oFrame.ncol
+
+    for ind in range(ncol):
+        assert new_types[ind] == old_types[ind], "Expected column types: {0}, actual column types: " \
+                                                 "{1}".format(old_types[ind], new_types[ind])


There seems to be a problem with missing values in non-numeric colums. I modified this test a bit:

new_types = new_pandas_frame.dtypes old_types = original_pandas_frame.dtypes ncol = h2oFrame.ncol - + print(dataset) for ind in range(ncol): + print(new_pandas_frame.columns[ind], new_types[ind], h2oFrame.types[h2oFrame.columns[ind]]) assert new_types[ind] == old_types[ind], "Expected column types: {0}, actual column types: " \ "{1}".format(old_types[ind], new_types[ind]) + + if new_types[ind] == "object": + diff = new_pandas_frame.iloc[:, ind] == original_pandas_frame.iloc[:, ind] + if not diff.all(): + print(original_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]]) + print(new_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]]) + print(original_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]].isna()) + print(new_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]].isna()) + assert diff.all() + else: + diff = (new_pandas_frame.iloc[:, ind] - original_pandas_frame.iloc[:, ind]).abs() + assert (diff.max() < 1e-10)

And found out the discrepancy. IIRC H2O exports missing values to csv as an empty string:

col1, col2, col3 1.618,,3.1415926535

This is read in pandas as if col2 is missing but it appears to me that in the datatable to pandas way the empty string is read in as an empty string and empty string is not a missing value in pandas.

original_pandas_frame ===================== cabin object enum 9 NaN 13 NaN 15 NaN 23 NaN 25 NaN ... 1304 NaN 1305 NaN 1306 NaN 1307 NaN 1308 NaN Name: cabin, Length: 1014, dtype: object new_pandas_frame ================ 9 13 15 23 25 .. 1304 1305 1306 1307 1308 Name: cabin, Length: 1014, dtype: object original_pandas_frame.isna() ============================ 9 True 13 True 15 True 23 True 25 True ... 1304 True 1305 True 1306 True 1307 True 1308 True Name: cabin, Length: 1014, dtype: bool new_pandas_frame.isna() ======================= 9 False 13 False 15 False 23 False 25 False ... 1304 False 1305 False 1306 False 1307 False 1308 False Name: cabin, Length: 1014, dtype: bool

I'm not sure if it's a big problem but we should at least document it somewhere (but I'd be in favor of a fix even if it should be something like new_frame.loc[h2o_frame[col].isna(), col] = pd.NA (pseudocode as we can't index pandas by h2o column) but there is probably a better way e.g. specifying NA strings ).

This problem needs to be fixed.

h2o-py/h2o/frame.py

tomasfryda

Looks good to me, thank you @wendycwong !

…led by H2O GH-15729: fix problem with missing values for string columns. GH-15729 fixed NAN issue for string columns Co-authored-by: Tomáš Frýda <[email protected]>

wendycwong requested review from mn-mikke and tomasfryda September 18, 2023 21:53

wendycwong force-pushed the wendy_GH_15729_as_data_frame_datatable branch from 4964eb6 to ecfc2c5 Compare September 18, 2023 21:56

tomasfryda reviewed Sep 19, 2023

View reviewed changes

wendycwong requested a review from tomasfryda September 19, 2023 17:55

tomasfryda reviewed Sep 20, 2023

View reviewed changes

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved

tomasfryda approved these changes Sep 20, 2023

View reviewed changes

wendycwong linked an issue Sep 25, 2023 that may be closed by this pull request

Problem using as_data_frame from polar #15729

Closed

wendycwong force-pushed the wendy_GH_15729_as_data_frame_datatable branch 2 times, most recently from e6239fc to 8049030 Compare September 25, 2023 15:13

wendycwong changed the title ~~GH_15729: implement multi-thread as-data-frame using datatable~~ GH_15729: implement multi-thread as-data-frame using datatable [nocheck] Sep 25, 2023

wendycwong changed the title ~~GH_15729: implement multi-thread as-data-frame using datatable [nocheck]~~ GH_15729: implement multi-thread as-data-frame using datatable Sep 25, 2023

GH-15729 change as_data_frame using Python data_table that is control…

bd36f44

…led by H2O GH-15729: fix problem with missing values for string columns. GH-15729 fixed NAN issue for string columns Co-authored-by: Tomáš Frýda <[email protected]>

wendycwong force-pushed the wendy_GH_15729_as_data_frame_datatable branch from 8049030 to bd36f44 Compare September 25, 2023 22:26

wendycwong changed the title ~~GH_15729: implement multi-thread as-data-frame using datatable~~ GH_15729: implement multi-thread as-data-frame using datatable [nocheck] Sep 25, 2023

wendycwong merged commit 3c41e85 into rel-3.42.0 Sep 26, 2023
2 checks passed

wendycwong deleted the wendy_GH_15729_as_data_frame_datatable branch September 26, 2023 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH_15729: implement multi-thread as-data-frame using datatable [nocheck] #15771

GH_15729: implement multi-thread as-data-frame using datatable [nocheck] #15771

wendycwong commented Sep 18, 2023

tomasfryda Sep 19, 2023 •

edited

Loading

wendycwong Sep 19, 2023

tomasfryda left a comment

GH_15729: implement multi-thread as-data-frame using datatable [nocheck] #15771

GH_15729: implement multi-thread as-data-frame using datatable [nocheck] #15771

Conversation

wendycwong commented Sep 18, 2023

tomasfryda Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

wendycwong Sep 19, 2023

Choose a reason for hiding this comment

tomasfryda left a comment

Choose a reason for hiding this comment

tomasfryda Sep 19, 2023 •

edited

Loading