Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create rule S6970: The Scikit-learn \"fit\" method should be called before methods yielding results #3869

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

You can preview this rule here (updated a few minutes after each push).

Review

A dedicated reviewer checked the rule description successfully for:

  • logical errors and incorrect information
  • information gaps and missing content
  • text style and tone
  • PR summary and labels follow the guidelines

@joke1196 joke1196 changed the title Create rule S6970 Create rule S6970: The Scikit-learn \"fit\" method should be called before methods yielding results Apr 15, 2024
@joke1196 joke1196 marked this pull request as ready for review April 15, 2024 11:59

== Why is this an issue?

When using the Scikit-learn library it is crucial to train the model before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace "model" by "estimator" or "transformer", or both

attempting to get results. Failing to do so can lead to incorrect results or runtime errors.
The training is done with the help of the `fit` method and retrieving results can be done for example with the `predict` method.

If the `predict` method is called without a prior call to the `fit` method, a `NotFittedError` will be thrown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify it's an exception ?

Comment on lines 18 to 22
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)

results = clf.cv_results_ # raises an AttributeError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a more minimalistic exemple like :

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

iris = load_iris()
knn = KNeighborsClassifier(1)

knn.n_samples_fit_ # Raises an AttributeError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and it makes me notice that I did not put the list of attributes that would raise an error.

results = clf.cv_results_ # raises an AttributeError
----

In the example above failing to train the model on the iris dataset with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add , after "above"


== How to fix it

To fix the issue train the model by using the `fit` method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ","

@@ -75,16 +75,16 @@ kmeans.predict(X) # Compliant
== Resources
=== Documentation

* Scikit-learn Documentation - https://scikit-learn.org/stable/glossary.html#term-fit[term fit reference]
* Scikit-learn Documentation - https://scikit-learn.org/stable/glossary.html#term-fit[Glossary fit reference]
* Scikit-learn Documentation - https://scikit-learn.org/stable/modules/generated/sklearn.exceptions.NotFittedError.html#sklearn.exceptions.NotFittedError[NotFittedError reference]

ifdef::env-github,rspecator-view[]

Implementation details:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if the object is deserialized through pickle/joblib/... ? Or are we going to lose the type anyways ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would be able to get the proper type back if it is deserialized that's true. I don't think we will be able to raise the error correctly in that case.

iris = datasets.load_iris()
X = iris.data

kmeans = KMeans(n_clusters=3, random_state=42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have a correct random_state

Copy link
Contributor

@joke1196 joke1196 Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what sense this is not correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's my bad. I got confused with the numpy.random.Generator

"sqKey": "S6970",
"scope": "All",
"defaultQualityProfiles": ["Sonar way"],
"quickfix": "unknown",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: maybe set the quick fix to infeasible ?

Copy link

Quality Gate passed Quality Gate passed for 'rspec-frontend'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Copy link

Quality Gate passed Quality Gate passed for 'rspec-tools'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants