Missing values support is not consistent #770

Vincent-Maladiere · 2023-09-29T16:28:28Z

GapEncoder and deduplicate raise different errors when dealing with None or np.nan values, unlike MinHashEncoder and SimilarityEncoder which run successfully.

Interestingly, the errors differ when the column to encode is of high cardinality, like "department" or low cardinality/binary like "gender", from the employee dataset.

In the table below, we replace values in the columns "department" and "gender" with either np.nan or a None values, e.g. "department" with None corresponds to:

from skrub import GapEncoder
from skrub.datasets import fetch_employee_salaries

df = fetch_employee_salaries().X
df["department"].replace("POL", None, inplace=True)

GapEncoder().fit_transform(df[["department"]])
# AssertionError: Input data is not string.

	"department" with np.nan	"department" with None	"gender" with np.nan	"gender" with None
GapEncoder	Success	AssertionError: Input data is not string	ValueError: empty vocabulary; perhaps the documents only contain stop words	TypeError: '<' not supported between instances of 'NoneType' and 'str'
deduplicate	# TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'	# TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'	TypeError: '<' not supported between instances of 'float' and 'str'	TypeError: '<' not supported between instances of 'NoneType' and 'str'

The text was updated successfully, but these errors were encountered:

jeromedockes · 2023-10-04T12:42:32Z

For GapEncoder and "department":

GapEncoder converts to numpy array, then finds and handles missing values by calling sklearn.utils.fixes._object_dtype_isnan

skrub/skrub/_gap_encoder.py

Line 860 in fade200

missing_mask = _object_dtype_isnan(X)

This in turn finds null values by comparing X != X.
np.nan != np.nan is True, but None != None is False, which is why this method does not find None entries as being missing values, they are not imputed (replaced with ""), and later the check which asserts the first value in the series is a string fails.

Note using _object_dtype_isnan before extracting the dataframe values into a numpy array, or simply using pd.isnull or pd.isna, would correctly find the None entries.

jeromedockes · 2023-10-04T12:50:04Z

for GapEncoder with np.nan: this one is actually not related to missing values, if you don't insert missing values you get the same error. The default n-gram range of the CountVectorizer starts at 2, so documents of length 1 result in 0 tokens, and the column contains only "F" and "M"

GaelVaroquaux · 2023-10-04T12:57:35Z

IMHO, we should strive to not error by default on missing values

jeromedockes · 2023-10-04T12:58:44Z

For GapEncoder "gender" and None: the behavior is actually the same as for the high-cardinality "department", what matters is whether the first (index 0) value is None or not, because the check only looks at the first:

skrub/skrub/_gap_encoder.py

Line 304 in fade200

assert isinstance(X[0], str), "Input data is not string. "

if it is None it fails at this point.
If the None is elsewhere, the check passes but later on a call to np.unique fails in the CountVectorizer when it builds its vocabulary.

jeromedockes · 2023-10-04T13:02:00Z

for dedupliate: deduplicate performs no special handling of missing values, so the call to np.unique on the first line fails whenever there are any

GaelVaroquaux · 2023-10-04T13:03:54Z

for `dedupliate`: `deduplicate` performs no special handling of missing values, so the call to `np.unique` on the first line fails whenever there are any

Actually my comment above does not apply to deduplicate

jeromedockes · 2023-10-04T13:09:43Z

Actually my comment above does not apply to deduplicate

why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?

GaelVaroquaux · 2023-10-04T13:17:44Z

> Actually my comment above does not apply to deduplicate why not? couldn't we deduplicate the other non-missing strings and leave the missing values missing?

Actually, given that we are matching only on one column, it does make sense indeed. So agreed with your proposal

Vincent-Maladiere added the bug Something isn't working label Sep 29, 2023

jeromedockes mentioned this issue Oct 4, 2023

handling None in GapEncoder #779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing values support is not consistent #770

Missing values support is not consistent #770

Vincent-Maladiere commented Sep 29, 2023

jeromedockes commented Oct 4, 2023

jeromedockes commented Oct 4, 2023

GaelVaroquaux commented Oct 4, 2023 via email

jeromedockes commented Oct 4, 2023

jeromedockes commented Oct 4, 2023

GaelVaroquaux commented Oct 4, 2023 via email

jeromedockes commented Oct 4, 2023 via email

GaelVaroquaux commented Oct 4, 2023 via email

Missing values support is not consistent #770

Missing values support is not consistent #770

Comments

Vincent-Maladiere commented Sep 29, 2023

jeromedockes commented Oct 4, 2023

jeromedockes commented Oct 4, 2023

GaelVaroquaux commented Oct 4, 2023 via email

jeromedockes commented Oct 4, 2023

jeromedockes commented Oct 4, 2023

GaelVaroquaux commented Oct 4, 2023 via email

jeromedockes commented Oct 4, 2023 via email

GaelVaroquaux commented Oct 4, 2023 via email