-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing values support is not consistent #770
Comments
For
Line 860 in fade200
This in turn finds null values by comparing Note using |
for |
IMHO, we should strive to not error by default on missing values
|
For Line 304 in fade200
if it is |
for |
for `dedupliate`: `deduplicate` performs no special handling of missing values, so the call to `np.unique` on the first line fails whenever there are any
Actually my comment above does not apply to deduplicate
|
Actually my comment above does not apply to deduplicate
why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?
|
> Actually my comment above does not apply to deduplicate
why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?
Actually, given that we are matching only on one column, it does make sense indeed. So agreed with your proposal
|
GapEncoder
anddeduplicate
raise different errors when dealing with None or np.nan values, unlikeMinHashEncoder
andSimilarityEncoder
which run successfully.Interestingly, the errors differ when the column to encode is of high cardinality, like "department" or low cardinality/binary like "gender", from the employee dataset.
In the table below, we replace values in the columns "department" and "gender" with either
np.nan
or aNone
values, e.g. "department" with None corresponds to:The text was updated successfully, but these errors were encountered: