-
Notifications
You must be signed in to change notification settings - Fork 8
Add tools for testing and make an example with CPI #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for opening this.
LGTM overall.
If you can split it into smaller PRs, it will make our life easier.
Co-authored-by: bthirion <[email protected]>
assert ( | ||
rho_noise_time >= 0.0 and rho_noise_time <= 1.0 | ||
), "rho_noise_time must be between 0 and 1" | ||
assert snr >= 0.0, "snr must be positive" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can simplify the code a little by imposing snr>0 and removing the test L273. SNR of 0 is irrelevant as the entire function becomes useless since there is no need to select coefficients
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think that snr = 0.0 is not relevant?
The snr controls only the noise. By pointing to zeros, I want to consider the case where there is no additive noise.
This is the simplest and most native cases but it is always good for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snr=0 means that the signal is negligible compared to the noise. That is, you only have noise. IMO you don't need a function for that, you can just do X, y = np.random.randn(), ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, snr=0, gives an error because it's there will be a division by zero.
I chose this value to disable the noise. Do you prefer to put snr=np.inf for disabling the noise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be this way.
And if we want to allow passing snr=0
, then we can test for it at the beginning and directly return noise instead of going through the whole function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep the function because I want to keep the effect of sigma and correlation for the noise.
I think now it should be better.
assert imputation_model_continuous is None or issubclass( | ||
imputation_model_continuous.__class__, BaseEstimator | ||
), "Continous imputation model invalid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this robust to different variations of imputation model such as OneVsRestClassifier
, MultiOutputClassifier
... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test here is for testing if the model is an estimator.
It doesn't test if it's a classifier or a regressor.
True coefficient vector/matrix with support_size non-zero elements. | ||
non_zero : ndarray | ||
Indices of non-zero coefficients in beta_true. | ||
noise_factor : float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consistently use snr
in the API instead of noise_mag
, noise_factor
...
It may add one line of code, but it improves clarity and uniformity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you mean?
The noise_mag is the coefficient that adds it to the noise to have the right ratio and snr is the signal noise ratio.
For me, these two elements are quite different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two are redundant. You define the noise_mag
as a function of the snr
( L272/275).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noise_mag
is a result of the function and snr
is a parameter of the function.
I don't see the redundancy here.
The only redundancy is sigma_noise
and snr
which are 2 scaling parameters of the same noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you have the noise_mag
and the noisy signal, you can get the snr
; similarly, if you have snr
, noisy signal, you can get the noise_mag
.
My point is that having a consistent parameter controlling the noise level makes the code easier to understand and facilitates the API's use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have
snr
, noisy signal, you can get thenoise_mag
.
For me, it's not possible because you don't have the signal amplitude.
noise_mag
is not a controlling variable, it's a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we need noise_mag
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noise_mag
is the noise magnitude.
It corresponds to the final variance of the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but is it needed somewhere in the codebase. If not, I don't think it should be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more step. Thx !
examples/plot_dcrt_example.py
Outdated
) | ||
index_support = np.where(beta_true)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but my point is: rather use beta_true directly in all subsequent calls. The experience shows that in 99% of the cases you don't need index_support, and moreover, the where API is a bit awkward.
(I used to use where
a lot but this is a bad pattern)
examples/plot_dcrt_example.py
Outdated
) | ||
index_support = np.where(beta_true)[0] | ||
index_no_support = np.where(beta_true - 1)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might sonn revisit that and generate non-binary beta. For instance, it typically make sense to have different values to have differnt sensistivities on the variable. One may also want to check that negative beta work as well as positive beta.
For this reason, we should assume that beta_true
is an arbitrary scalar.
src/hidimstat/_utils/scenario.py
Outdated
Signal-to-noise ratio. Controls noise scaling. | ||
sigma_noise : float, default=1.0 | ||
Standard deviation of the noise. | ||
rho_noise_time : float, default=0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Rename rho_noise_time
to rho_serial
maybe (or even rho_samples
if we change our naming as discussed above).
" to set variable groups. If no grouping is needed," | ||
" call fit with groups=None" | ||
) | ||
count = 0 | ||
for group_id in self.groups.values(): | ||
if type(group_id[0]) is int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a light check for checking if the number of features of X_test corresponds to the number of features of X_train.
I check that the id feature in the group corresponds to a row in the X_test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be np.all
?
Also, I would not name the variable group_id
since self.groups.values()
actually contains variable ids, the group id would be self.groups.keys()
or the index of the key if keys
contains strings.
Finally, if we want to test this for arrays, why not test it for dataframes, making sure that the group members are in the list of column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be
np.all
?
Yes, it should benp.all
.
I change the name to index_variables
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, if we want to test this for arrays, why not test it for dataframes, making sure that the group members are in the list of column names?
For dataframe, it's a bit more complicated because getting the names of the columns in numpy or pandas is different.
I will try to add it.
), "X does not correspond to the fitting data." | ||
count += len(group_id) | ||
if X.shape[1] > count: | ||
warnings.warn("Not all features will has a importance score.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. In which situation would that occur ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment of @jpaillard before:
In a scenario where you know that a variable is predictive, but you are not interested in measuring its importance.
For instance, in the diabetes dataset, one may not be interested in the importance ofage
, but still want to include it in the predictive model to get the importance of other variables conditionally toage
. For instance,BloodPressure
which is correlated withage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider whether we have the right API for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measuring the importance of variables/groups that do not match the dimension of the data is a useful feature. For instance, to measure the importance of a few variables while controlling (conditioning) for others, for which we don't want to compute and report importance/p-values. Also, if we want to measure the importance of overlapping groups, for instance, in hierarchical clustering.
This feature is naturally supported by the current API, and I don't see any reason to prevent users from using it.
I would suggest removing the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but this should be documented and showcased in an example, because it is too easy to miss it, and then people will get results that they don't understand.
The alternative is to consider that there are "features to be tested" and nuisance features to condition upon but that are not to be considered.
examples/plot_dcrt_example.py
Outdated
) | ||
index_support = np.where(beta_true)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so support = beta_true != 0
(or np.abs(beta_true) > eps
)
@@ -84,6 +87,8 @@ def fit(self, X, y=None, groups=None): | |||
self._groups_ids = [ | |||
np.array(ids, dtype=int) for ids in list(self.groups.values()) | |||
] | |||
else: | |||
raise ValueError("groups needs to be a dictionnary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do groups have the same semantics as in sklearn ? If yes, we should use the same type. I belive it is an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the definition of groups
by the glossary of sklearn, we don't use the same semantics.
The group is sklearn is only related to cross-validation. In our case, the group can be seen as a cluster of features. I didn't any equivalence in the glossary of sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, skelran groups are samples rather than features. Yet , what is the modtivation to have a dictionary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation is to let the user provide some meaning for the group by adding a specific name to it if he wants.
Co-authored-by: bthirion <[email protected]>
Following the discussion in issue #188, I merge the 3 functions for generating data.
Based on this new function, I created a set of tests for CPI. Once the framework for the test is defined, this will be extended to all the methods.