Open
Description
What happened:
SimpleImputer.fit
with median
and most_frequent
strategies on frames compute different results comparing to scikit-learn
.
What you expected to happen:
They should have consistent results with sklearn.impute.SimpleImputer
.
Minimal Complete Verifiable Example:
df = pd.DataFrame({"A": [1, 1, np.nan, np.nan, 2, 2]})
# This should return the smallest value
b = dask_ml.impute.SimpleImputer(strategy="most_frequent", fill_value=None)
b.fit(df)
b.statistics_
>>> A 2.0
>>> dtype: float64
c = sklearn.impute.SimpleImputer(strategy="most_frequent", fill_value=None)
c.fit(df)
c.statistics_
>>> array([1.])
With median
:
df = pd.DataFrame({"A": [1, 1, np.nan, np.nan, 2, 2]})
df = dd.from_pandas(df, 2)
b = dask_ml.impute.SimpleImputer(strategy="median", fill_value=None)
b.fit(df)
b.statistics_
>>> A 1.0
>>> dtype: float64
c = sklearn.impute.SimpleImputer(strategy="median", fill_value=None)
c.fit(df)
c.statistics_
>>> array([1.5])
Environment:
- Dask version: 2021.01.1
- Python version: 3.7.6
- Operating System: MacOS
- Install method (conda, pip, source): pip
Metadata
Metadata
Assignees
Labels
No labels