-
Notifications
You must be signed in to change notification settings - Fork 8
dcrt0(1/4): API2: add comments and docstring of the functions #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the dosctring handling in a _utils
module
+
We don't see anything in the figures: replace the boxplots with bars + error bars.
fdr to fpr Co-authored-by: bthirion <[email protected]>
src/hidimstat/_utils/docstring.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? The docstrings were rendered properly before, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains the functions of creating the docstring for a function based on the docstring of the class.
joblib_verbose=0, | ||
fit_y=False, | ||
n_tree=100, | ||
problem_type="regression", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not specific to the class implementation.
I am not sure this parameter does what we expect it to do. It is only passed to the rf_distillation
.
For regression and classification, _fit_lasso
is called (L206) using a regression model. Using Lasso for classification doesn't break, but we would probably rather use Logistic Regression with L1 regularization instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the issue #253, for the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx. One important question concerns API homogeneity.
def importance( | ||
self, | ||
fpr=0.05, | ||
scaled_statistics=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importance
should have a uniform API across classes. I think that it should only accept X=None, y=None as argument.
fpr
is only useful for selection, thus not for importance computation
scaled_statistics
could be specified when creating the class.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During my first refactoring, importance returning : "importance value and a selection". In this case, fpr is required for importance.
From the discussion in the issue #217. This API is not yet formalised.
Once there will be more consensus on it, I will update this PR.
) | ||
for idx in selection_set | ||
) | ||
elif self.statistic == "random_forest": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we will have to revise the use of this so-called self.statistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way of doing it is to let the user choose its own estimator and the associate loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do want this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's leave that for another PR, but indeed, the estimator should be provided by the user (with a good default); I personally think RF are a good default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving forward, thx !
selection_features, X_res, sigma2, y_res | ||
) | ||
d0crt_lasso = D0CRT(screening=False, statistic="residual") | ||
d0crt_lasso.fit(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d0crt_lasso.fit(X, y) | |
d0crt_lasso = D0CRT(screening=False, statistic="residual").fit(X, y) |
) | ||
|
||
d0crt = D0CRT(problem_type="classification", screening=False) | ||
d0crt.fit(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d0crt.fit(X, y) | |
d0crt = D0CRT(problem_type="classification", screening=False).fit(X, y) |
|
||
d0crt = D0CRT(problem_type="classification", screening=False) | ||
d0crt.fit(X, y) | ||
_, pval_dcrt = d0crt.importance(fpr=0.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is already mentioned in another PR, but importance
should not take and fpr
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this was mentioned in issue #217.
However, there is no consensus if this function has a X_test and Y_test or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all importance methods should have should have an X=None, Y=None, as arguments, that would default to the already provided training data. In some cases, these arguments would not even be taken into account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that all the importance methods will have an X and y is not yet defined.
Using an X_test and y_test will change a bit the algorithm for computing the residual on this test data.
As a user, it's more readable to add additional optional parameters to a function than to have parameters which don't have any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is about API uniformity: you want to be able to loop over methods with maximally similar arguments without triggering an error. The mechanism is fundamental in sklearn and related libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you indicate to me one function in sklearn where some parameters are not used by the function due to the homogenisation of the API?
Threshold for variable screening (0-100) | ||
Whether to perform variable screening step based on Lasso coefficients | ||
screening_threshold : float, default=10 | ||
Percentile threshold for screening (0-100), larger values lead to the inclusion of more variables at the screening stage (screening_threshold=100 keeps all variables). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is too long.
) | ||
for idx in selection_set | ||
) | ||
elif self.statistic == "random_forest": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed
I cannot really move forward without knowing where I am going. |
All unsupervised learning class.fit() methods (clustering, PCA, ICA, manifold learning,...) in sklearn have an unused |
e5d5482
to
7a24bda
Compare
Please LMK if you need a review on that one. |
Not yet, I just updated the PR once the API is finalised. |
Following the Sprint a new API has been characterized.
This is an example of the first refactoring methods for this new API.