Skip to content
This repository has been archived by the owner on Dec 4, 2019. It is now read-only.

Update to latest scikit-learn release for deprecation and compatibility #53

Open
dsackin opened this issue Jun 6, 2017 · 12 comments
Open

Comments

@dsackin
Copy link

dsackin commented Jun 6, 2017

Using the current head 0.2.0 release of spark-sklearn and the current release of scikit-learn (0.18.1), I'm getting the following deprecation warning:

/.../python3.4/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

the library needs to be updated to use the new model_selection module and iterator interfaces.

In addition, due to changes in sklearn.model_selection.GridSearchCV, the attributes available on the fitted spark-sklearn.GridSearchCV are out of date.

sklearn.model_selection.GridSearchCV now has:

  • cv_results_ : dict of numpy (masked) ndarrays - A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.
  • best_estimator_ : estimator - Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.
  • best_score_ : float - Score of best_estimator on the left out data.
  • best_params_ : dict - Parameter setting that gave the best results on the hold out data.
  • best_index_ : int - The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.
  • scorer_ : function - Scorer function used on the held out data to choose the best parameters for the model.
  • n_splits_ : int - The number of cross-validation splits (folds/iterations).

While spark-sklearn.GridSearchCV has:

  • grid_scores_ : list of named tuples
  • best_estimator_ : estimator - Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.
  • best_score_ : float - Score of best_estimator on the left out data.
  • best_params_ : dict - Parameter setting that gave the best results on the hold out data.
  • scorer_ : function - Scorer function used on the held out data to choose the best parameters for the model.

The most critical difference is that sklearn added the more comprehensive cv_results_ which adds data that the formerly compatible grid_scores_ is lacking.

@ajaysaini725
Copy link
Contributor

I've been working on this and almost have a PR ready. It will be out this upcoming Monday.

@dsackin
Copy link
Author

dsackin commented Jun 20, 2017

Thank you for the quick attention. Is anything required of me? I see I was CCed on the related issue, but it looks like that was just for info.

@ajaysaini725
Copy link
Contributor

An update making spark-sklearn compatible with sklearn version >= 0.18.1 has been merged.

@dsackin
Copy link
Author

dsackin commented Jul 18, 2017

I'm just about to adopt this update. Can you mark a new release in github and and update the version in PyPi? I currently rely on pip for the installs in my environments. I was hoping not to have to change to git just for this package.

@gordontsai
Copy link

@dsackin Did you end up doing the git install? I'm also running into version issues when installing through pip.

@dsackin
Copy link
Author

dsackin commented Jul 19, 2017

No. I haven't updated yet. I was hoping they would push it into PyPi before I switched to git install.

@gordontsai
Copy link

Got it. Just an fyi, ended up doing the git install, and it worked.

@emceemouli
Copy link

emceemouli commented Aug 9, 2017

Can you please let us know when a new release be marked and push to PyPi would happen.

@emceemouli
Copy link

@gordontsai @dsackin I am quite new to git install...can you tell me how to perform git install while we wait this to be pushed pypi

@srowen
Copy link
Collaborator

srowen commented Dec 10, 2018

@thunterdb this is more about what it might take to support 0.20. We have a related issue about not setting things like best_params_ at #73, which seems like an easy fix but the simple fix doesn't run. This PR might also contain some of the necessary changes: #74 . This much I haven't looked into yet.

@thunterdb
Copy link
Contributor

I see, this is more than pointing to the right package. The 0.20 release is less than 2 months old, so let us focus on the 0.1x releases until there is a more general need for that. What are your thoughts?

@srowen
Copy link
Collaborator

srowen commented Dec 10, 2018

Yeah, certainly more concerned this second with a new release to fix some bugs, and maybe get random search in. If you have a sec to look at #73 you might know the quick answer; that might also be a quick fix relevant to 0.19

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants