Skip to content

dcrt0(1/4): API2: add comments and docstring of the functions #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 107 commits into
base: main
Choose a base branch
from

Conversation

lionelkusch
Copy link
Collaborator

Following the Sprint a new API has been characterized.
This is an example of the first refactoring methods for this new API.

Copy link
Collaborator

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the dosctring handling in a _utils module
+
We don't see anything in the figures: replace the boxplots with bars + error bars.

@lionelkusch lionelkusch added the API 2 Refactoring follwoing the second version of API label Apr 11, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? The docstrings were rendered properly before, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains the functions of creating the docstring for a function based on the docstring of the class.

joblib_verbose=0,
fit_y=False,
n_tree=100,
problem_type="regression",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not specific to the class implementation.
I am not sure this parameter does what we expect it to do. It is only passed to the rf_distillation.
For regression and classification, _fit_lasso is called (L206) using a regression model. Using Lasso for classification doesn't break, but we would probably rather use Logistic Regression with L1 regularization instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the issue #253, for the discussion.

Copy link
Collaborator

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. One important question concerns API homogeneity.

def importance(
self,
fpr=0.05,
scaled_statistics=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importance should have a uniform API across classes. I think that it should only accept X=None, y=None as argument.
fpr is only useful for selection, thus not for importance computation
scaled_statistics could be specified when creating the class.
WDYT ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my first refactoring, importance returning : "importance value and a selection". In this case, fpr is required for importance.
From the discussion in the issue #217. This API is not yet formalised.
Once there will be more consensus on it, I will update this PR.

)
for idx in selection_set
)
elif self.statistic == "random_forest":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we will have to revise the use of this so-called self.statistic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way of doing it is to let the user choose its own estimator and the associate loss.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do want this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave that for another PR, but indeed, the estimator should be provided by the user (with a good default); I personally think RF are a good default.

Copy link
Collaborator

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving forward, thx !

selection_features, X_res, sigma2, y_res
)
d0crt_lasso = D0CRT(screening=False, statistic="residual")
d0crt_lasso.fit(X, y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
d0crt_lasso.fit(X, y)
d0crt_lasso = D0CRT(screening=False, statistic="residual").fit(X, y)

)

d0crt = D0CRT(problem_type="classification", screening=False)
d0crt.fit(X, y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
d0crt.fit(X, y)
d0crt = D0CRT(problem_type="classification", screening=False).fit(X, y)


d0crt = D0CRT(problem_type="classification", screening=False)
d0crt.fit(X, y)
_, pval_dcrt = d0crt.importance(fpr=0.05)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is already mentioned in another PR, but importance should not take and fpr argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was mentioned in issue #217.
However, there is no consensus if this function has a X_test and Y_test or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all importance methods should have should have an X=None, Y=None, as arguments, that would default to the already provided training data. In some cases, these arguments would not even be taken into account.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that all the importance methods will have an X and y is not yet defined.
Using an X_test and y_test will change a bit the algorithm for computing the residual on this test data.

As a user, it's more readable to add additional optional parameters to a function than to have parameters which don't have any effect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is about API uniformity: you want to be able to loop over methods with maximally similar arguments without triggering an error. The mechanism is fundamental in sklearn and related libraries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you indicate to me one function in sklearn where some parameters are not used by the function due to the homogenisation of the API?

Threshold for variable screening (0-100)
Whether to perform variable screening step based on Lasso coefficients
screening_threshold : float, default=10
Percentile threshold for screening (0-100), larger values lead to the inclusion of more variables at the screening stage (screening_threshold=100 keeps all variables).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too long.

)
for idx in selection_set
)
elif self.statistic == "random_forest":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed

@lionelkusch
Copy link
Collaborator Author

Moving forward, thx !

I cannot really move forward without knowing where I am going.
The API is still under discussion in issue #217 . Once, the discussion arrives at some point. I will update it.

@bthirion
Copy link
Collaborator

All unsupervised learning class.fit() methods (clustering, PCA, ICA, manifold learning,...) in sklearn have an unused y parameter. But please to sklearn developers. They will give you better hints than me.

@bthirion
Copy link
Collaborator

Please LMK if you need a review on that one.

@lionelkusch
Copy link
Collaborator Author

Not yet, I just updated the PR once the API is finalised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API 2 Refactoring follwoing the second version of API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants