Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values support is not consistent #770

Open
Vincent-Maladiere opened this issue Sep 29, 2023 · 8 comments
Open

Missing values support is not consistent #770

Vincent-Maladiere opened this issue Sep 29, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@Vincent-Maladiere
Copy link
Member

GapEncoder and deduplicate raise different errors when dealing with None or np.nan values, unlike MinHashEncoder and SimilarityEncoder which run successfully.

Interestingly, the errors differ when the column to encode is of high cardinality, like "department" or low cardinality/binary like "gender", from the employee dataset.

In the table below, we replace values in the columns "department" and "gender" with either np.nan or a None values, e.g. "department" with None corresponds to:

from skrub import GapEncoder
from skrub.datasets import fetch_employee_salaries

df = fetch_employee_salaries().X
df["department"].replace("POL", None, inplace=True)

GapEncoder().fit_transform(df[["department"]])
# AssertionError: Input data is not string. 
"department" with np.nan "department" with None "gender" with np.nan "gender" with None
GapEncoder Success AssertionError: Input data is not string ValueError: empty vocabulary; perhaps the documents only contain stop words TypeError: '<' not supported between instances of 'NoneType' and 'str'
deduplicate # TypeError: '<' not supported between instances of 'NoneType' and 'NoneType' # TypeError: '<' not supported between instances of 'NoneType' and 'NoneType' TypeError: '<' not supported between instances of 'float' and 'str' TypeError: '<' not supported between instances of 'NoneType' and 'str'
@Vincent-Maladiere Vincent-Maladiere added the bug Something isn't working label Sep 29, 2023
@jeromedockes
Copy link
Member

For GapEncoder and "department":

GapEncoder converts to numpy array, then finds and handles missing values by calling sklearn.utils.fixes._object_dtype_isnan

missing_mask = _object_dtype_isnan(X)

This in turn finds null values by comparing X != X.
np.nan != np.nan is True, but None != None is False, which is why this method does not find None entries as being missing values, they are not imputed (replaced with ""), and later the check which asserts the first value in the series is a string fails.

Note using _object_dtype_isnan before extracting the dataframe values into a numpy array, or simply using pd.isnull or pd.isna, would correctly find the None entries.

@jeromedockes
Copy link
Member

for GapEncoder with np.nan: this one is actually not related to missing values, if you don't insert missing values you get the same error. The default n-gram range of the CountVectorizer starts at 2, so documents of length 1 result in 0 tokens, and the column contains only "F" and "M"

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 4, 2023 via email

@jeromedockes
Copy link
Member

For GapEncoder "gender" and None: the behavior is actually the same as for the high-cardinality "department", what matters is whether the first (index 0) value is None or not, because the check only looks at the first:

assert isinstance(X[0], str), "Input data is not string. "

if it is None it fails at this point.
If the None is elsewhere, the check passes but later on a call to np.unique fails in the CountVectorizer when it builds its vocabulary.

@jeromedockes
Copy link
Member

for dedupliate: deduplicate performs no special handling of missing values, so the call to np.unique on the first line fails whenever there are any

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 4, 2023 via email

@jeromedockes
Copy link
Member

jeromedockes commented Oct 4, 2023 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 4, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants