Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH_15729: implement multi-thread as-data-frame using datatable [nocheck] #15771

Merged
merged 1 commit into from
Sep 26, 2023

Conversation

wendycwong
Copy link
Contributor

This PR resolves the issue here: #15729.

Basically, I use python datatable which is controlled by H2O-3 to perform the multi-thread conversion of H2O-3 frame to pandas frame.

In particulare, the column types are specified in the datatable.fread to make sure the data types are not changed.

@wendycwong wendycwong force-pushed the wendy_GH_15729_as_data_frame_datatable branch from 4964eb6 to ecfc2c5 Compare September 18, 2023 21:56
Comment on lines 10 to 36
def test_frame_conversion(dataset, compareTime):
print("Performing data conversion to pandas for dataset: {0}".format(dataset))
h2oFrame = h2o.import_file(pyunit_utils.locate(dataset))
# convert frame not using datatable
startT = time.time()
original_pandas_frame = h2oFrame.as_data_frame()
oldTime = time.time()-startT
print("H2O frame to Pandas frame conversion time: {0}".format(oldTime))
# convert frame using datatable
startT = time.time()
new_pandas_frame = h2oFrame.as_data_frame(multi_thread=True)
newTime = time.time()-startT
print("H2O frame to Pandas frame conversion time using datatable: {0}".format(newTime))
if compareTime: # disable for small dataset. Multi-thread can be slower due to more overhead
assert newTime <= oldTime, " original frame conversion time: {0} should exceed new frame conversion time:" \
"{1} but is not.".format(oldTime, newTime)
# compare two frames column types
new_types = new_pandas_frame.dtypes
old_types = original_pandas_frame.dtypes
ncol = h2oFrame.ncol

for ind in range(ncol):
assert new_types[ind] == old_types[ind], "Expected column types: {0}, actual column types: " \
"{1}".format(old_types[ind], new_types[ind])
Copy link
Contributor

@tomasfryda tomasfryda Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a problem with missing values in non-numeric colums. I modified this test a bit:

     new_types = new_pandas_frame.dtypes
     old_types = original_pandas_frame.dtypes
     ncol = h2oFrame.ncol
-    
+    print(dataset)
     for ind in range(ncol):
+        print(new_pandas_frame.columns[ind], new_types[ind], h2oFrame.types[h2oFrame.columns[ind]])
         assert new_types[ind] == old_types[ind], "Expected column types: {0}, actual column types: " \
                                                  "{1}".format(old_types[ind], new_types[ind])
+        
+        if new_types[ind] == "object":
+            diff = new_pandas_frame.iloc[:, ind] == original_pandas_frame.iloc[:, ind]
+            if not diff.all():
+                print(original_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]])
+                print(new_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]])
+                print(original_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]].isna())
+                print(new_pandas_frame.loc[~diff, new_pandas_frame.columns[ind]].isna())
+            assert diff.all()
+        else:
+            diff = (new_pandas_frame.iloc[:, ind] - original_pandas_frame.iloc[:, ind]).abs()
+            assert (diff.max() < 1e-10)

And found out the discrepancy. IIRC H2O exports missing values to csv as an empty string:

col1, col2, col3
1.618,,3.1415926535

This is read in pandas as if col2 is missing but it appears to me that in the datatable to pandas way the empty string is read in as an empty string and empty string is not a missing value in pandas.

original_pandas_frame
=====================
cabin object enum
9       NaN
13      NaN
15      NaN
23      NaN
25      NaN
       ... 
1304    NaN
1305    NaN
1306    NaN
1307    NaN
1308    NaN
Name: cabin, Length: 1014, dtype: object

new_pandas_frame
================
9        
13       
15       
23       
25       
       ..
1304     
1305     
1306     
1307     
1308     
Name: cabin, Length: 1014, dtype: object

original_pandas_frame.isna()
============================
9       True
13      True
15      True
23      True
25      True
        ... 
1304    True
1305    True
1306    True
1307    True
1308    True
Name: cabin, Length: 1014, dtype: bool

new_pandas_frame.isna()
=======================
9       False
13      False
15      False
23      False
25      False
        ...  
1304    False
1305    False
1306    False
1307    False
1308    False
Name: cabin, Length: 1014, dtype: bool

I'm not sure if it's a big problem but we should at least document it somewhere (but I'd be in favor of a fix even if it should be something like new_frame.loc[h2o_frame[col].isna(), col] = pd.NA (pseudocode as we can't index pandas by h2o column) but there is probably a better way e.g. specifying NA strings ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This problem needs to be fixed.

h2o-py/h2o/frame.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you @wendycwong !

@wendycwong wendycwong linked an issue Sep 25, 2023 that may be closed by this pull request
@wendycwong wendycwong force-pushed the wendy_GH_15729_as_data_frame_datatable branch 2 times, most recently from e6239fc to 8049030 Compare September 25, 2023 15:13
@wendycwong wendycwong changed the title GH_15729: implement multi-thread as-data-frame using datatable GH_15729: implement multi-thread as-data-frame using datatable [nocheck] Sep 25, 2023
@wendycwong wendycwong changed the title GH_15729: implement multi-thread as-data-frame using datatable [nocheck] GH_15729: implement multi-thread as-data-frame using datatable Sep 25, 2023
…led by H2O

GH-15729: fix problem with missing values for string columns.
GH-15729  fixed NAN issue for string columns
Co-authored-by: Tomáš Frýda <[email protected]>
@wendycwong wendycwong force-pushed the wendy_GH_15729_as_data_frame_datatable branch from 8049030 to bd36f44 Compare September 25, 2023 22:26
@wendycwong wendycwong changed the title GH_15729: implement multi-thread as-data-frame using datatable GH_15729: implement multi-thread as-data-frame using datatable [nocheck] Sep 25, 2023
@wendycwong wendycwong merged commit 3c41e85 into rel-3.42.0 Sep 26, 2023
2 checks passed
@wendycwong wendycwong deleted the wendy_GH_15729_as_data_frame_datatable branch September 26, 2023 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem using as_data_frame from polar
2 participants