Add tools for testing and make an example with CPI #265

lionelkusch · 2025-05-23T18:03:10Z

Following the discussion in issue #188, I merge the 3 functions for generating data.
Based on this new function, I created a set of tests for CPI. Once the framework for the test is defined, this will be extended to all the methods.

It requires some reflexion about the method. The threshold based on the coefficient for estimating the weight seems strange. The equation signal to ratio seems also weird

bthirion

Thx for opening this.
LGTM overall.
If you can split it into smaller PRs, it will make our life easier.

src/hidimstat/_utils/scenario.py

src/hidimstat/base_perturbation.py

src/hidimstat/noise_std.py

test/_utils/test_scenario.py

test/conftest.py

src/hidimstat/noise_std.py

src/hidimstat/_utils/scenario.py

test/_utils/test_scenario.py

Co-authored-by: bthirion <[email protected]>

jpaillard · 2025-06-06T07:37:12Z

src/hidimstat/_utils/scenario.py

+    assert (
+        rho_noise_time >= 0.0 and rho_noise_time <= 1.0
+    ), "rho_noise_time must be between 0 and 1"
+    assert snr >= 0.0, "snr must be positive"


Maybe we can simplify the code a little by imposing snr>0 and removing the test L273. SNR of 0 is irrelevant as the entire function becomes useless since there is no need to select coefficients

Why do you think that snr = 0.0 is not relevant?
The snr controls only the noise. By pointing to zeros, I want to consider the case where there is no additive noise.
This is the simplest and most native cases but it is always good for testing.

snr=0 means that the signal is negligible compared to the noise. That is, you only have noise. IMO you don't need a function for that, you can just do X, y = np.random.randn(), ...

By default, snr=0, gives an error because it's there will be a division by zero.
I chose this value to disable the noise. Do you prefer to put snr=np.inf for disabling the noise?

Yes, it should be this way.

And if we want to allow passing snr=0, then we can test for it at the beginning and directly return noise instead of going through the whole function.

I keep the function because I want to keep the effect of sigma and correlation for the noise.
I think now it should be better.

src/hidimstat/_utils/scenario.py

src/hidimstat/base_perturbation.py

jpaillard · 2025-06-06T07:56:48Z

src/hidimstat/conditional_permutation_importance.py

+        assert imputation_model_continuous is None or issubclass(
+            imputation_model_continuous.__class__, BaseEstimator
+        ), "Continous imputation model invalid"


Is this robust to different variations of imputation model such as OneVsRestClassifier, MultiOutputClassifier ... ?

The test here is for testing if the model is an estimator.
It doesn't test if it's a classifier or a regressor.

jpaillard · 2025-06-06T08:00:39Z

src/hidimstat/_utils/scenario.py

+        True coefficient vector/matrix with support_size non-zero elements.
+    non_zero : ndarray
+        Indices of non-zero coefficients in beta_true.
+    noise_factor : float


I think we should consistently use snr in the API instead of noise_mag, noise_factor ...
It may add one line of code, but it improves clarity and uniformity.

Why do you mean?
The noise_mag is the coefficient that adds it to the noise to have the right ratio and snr is the signal noise ratio.
For me, these two elements are quite different.

I think these two are redundant. You define the noise_mag as a function of the snr ( L272/275).

noise_mag is a result of the function and snr is a parameter of the function.
I don't see the redundancy here.
The only redundancy is sigma_noise and snr which are 2 scaling parameters of the same noise.

But if you have the noise_mag and the noisy signal, you can get the snr; similarly, if you have snr, noisy signal, you can get the noise_mag.
My point is that having a consistent parameter controlling the noise level makes the code easier to understand and facilitates the API's use.

if you have snr, noisy signal, you can get the noise_mag.

For me, it's not possible because you don't have the signal amplitude.
noise_mag is not a controlling variable, it's a result.

When do we need noise_mag ?

noise_mag is the noise magnitude.
It corresponds to the final variance of the noise.

Yes, but is it needed somewhere in the codebase. If not, I don't think it should be returned.

bthirion

One more step. Thx !

bthirion · 2025-06-17T20:21:52Z

examples/plot_dcrt_example.py

    )
+    index_support = np.where(beta_true)[0]


Yes, but my point is: rather use beta_true directly in all subsequent calls. The experience shows that in 99% of the cases you don't need index_support, and moreover, the where API is a bit awkward.
(I used to use where a lot but this is a bad pattern)

bthirion · 2025-06-17T20:24:10Z

examples/plot_dcrt_example.py

    )
+    index_support = np.where(beta_true)[0]
+    index_no_support = np.where(beta_true - 1)[0]


We might sonn revisit that and generate non-binary beta. For instance, it typically make sense to have different values to have differnt sensistivities on the variable. One may also want to check that negative beta work as well as positive beta.
For this reason, we should assume that beta_true is an arbitrary scalar.

src/hidimstat/_utils/scenario.py

bthirion · 2025-06-17T20:34:24Z

src/hidimstat/_utils/scenario.py

+        Signal-to-noise ratio. Controls noise scaling.
+    sigma_noise : float, default=1.0
+        Standard deviation of the noise.
+    rho_noise_time : float, default=0.0


OK. Rename rho_noise_time to rho_serial maybe (or even rho_samples if we change our naming as discussed above).

bthirion · 2025-06-17T20:42:27Z

src/hidimstat/base_perturbation.py

                " to set variable groups. If no grouping is needed,"
                " call fit with groups=None"
            )
+        count = 0
+        for group_id in self.groups.values():
+            if type(group_id[0]) is int:


I don't understand this check.

It's a light check for checking if the number of features of X_test corresponds to the number of features of X_train.
I check that the id feature in the group corresponds to a row in the X_test.

Shouldn't it be np.all?
Also, I would not name the variable group_id since self.groups.values() actually contains variable ids, the group id would be self.groups.keys() or the index of the key if keys contains strings.

Finally, if we want to test this for arrays, why not test it for dataframes, making sure that the group members are in the list of column names?

Shouldn't it be np.all?
Yes, it should be np.all.

I change the name to index_variables.

Finally, if we want to test this for arrays, why not test it for dataframes, making sure that the group members are in the list of column names?

For dataframe, it's a bit more complicated because getting the names of the columns in numpy or pandas is different.
I will try to add it.

bthirion · 2025-06-17T20:43:00Z

src/hidimstat/base_perturbation.py

+                ), "X does not correspond to the fitting data."
+            count += len(group_id)
+        if X.shape[1] > count:
+            warnings.warn("Not all features will has a importance score.")


This is weird. In which situation would that occur ?

See the comment of @jpaillard before:

In a scenario where you know that a variable is predictive, but you are not interested in measuring its importance.
For instance, in the diabetes dataset, one may not be interested in the importance of age, but still want to include it in the predictive model to get the importance of other variables conditionally to age. For instance, BloodPressure which is correlated with age.

We should consider whether we have the right API for that.

Measuring the importance of variables/groups that do not match the dimension of the data is a useful feature. For instance, to measure the importance of a few variables while controlling (conditioning) for others, for which we don't want to compute and report importance/p-values. Also, if we want to measure the importance of overlapping groups, for instance, in hierarchical clustering.

This feature is naturally supported by the current API, and I don't see any reason to prevent users from using it.
I would suggest removing the warning.

OK, but this should be documented and showcased in an example, because it is too easy to miss it, and then people will get results that they don't understand.
The alternative is to consider that there are "features to be tested" and nuisance features to condition upon but that are not to be considered.

src/hidimstat/noise_std.py

bthirion · 2025-06-17T20:48:39Z

examples/plot_dcrt_example.py

    )
+    index_support = np.where(beta_true)[0]


so support = beta_true != 0 (or np.abs(beta_true) > eps)

bthirion · 2025-06-17T20:49:55Z

src/hidimstat/base_perturbation.py

@@ -84,6 +87,8 @@ def fit(self, X, y=None, groups=None):
                self._groups_ids = [
                    np.array(ids, dtype=int) for ids in list(self.groups.values())
                ]
+        else:
+            raise ValueError("groups needs to be a dictionnary")


Do groups have the same semantics as in sklearn ? If yes, we should use the same type. I belive it is an array.

Based on the definition of groups by the glossary of sklearn, we don't use the same semantics.
The group is sklearn is only related to cross-validation. In our case, the group can be seen as a cluster of features. I didn't any equivalence in the glossary of sklearn.

Ok, skelran groups are samples rather than features. Yet , what is the modtivation to have a dictionary ?

The motivation is to let the user provide some meaning for the group by adding a specific name to it if he wants.

Co-authored-by: bthirion <[email protected]>

lionelkusch added 13 commits May 23, 2025 14:50

remove not necessary function for testing

2524418

add a check on the number of featires for X

3688c9c

add assertion on the model

ee9cf84

improve the check_fit

be4170c

update data_generation

2f2779c

add function for the generation of data for the tests

3a228b2

improve test for CPI

5348e89

fix tests

bf7383f

improve geenration of model

6aa3f41

change multivariate function but need to fix the function reid

dfcc77c

change the generation of data

7fc28dc

Add TODO

74d8462

It requires some reflexion about the method. The threshold based on the coefficient for estimating the weight seems strange. The equation signal to ratio seems also weird

add tests for senario

0ce7094

lionelkusch requested review from bthirion and jpaillard May 23, 2025 18:03

lionelkusch added the test Question link to tests label May 23, 2025

bthirion reviewed May 25, 2025

View reviewed changes

lionelkusch commented May 26, 2025

View reviewed changes

src/hidimstat/noise_std.py Outdated Show resolved Hide resolved

src/hidimstat/noise_std.py Outdated Show resolved Hide resolved

src/hidimstat/_utils/scenario.py Outdated Show resolved Hide resolved

test/_utils/test_scenario.py Show resolved Hide resolved

lionelkusch and others added 2 commits May 26, 2025 10:45

fix tests

297e26d

Update src/hidimstat/_utils/scenario.py

b105d5d

Co-authored-by: bthirion <[email protected]>

lionelkusch mentioned this pull request May 26, 2025

An issue with the estimation of the support in reid function. #266

Open

Fix docstring in the PR

03d41fb

lionelkusch mentioned this pull request May 26, 2025

How to estimate signal-to-noise ratio? #267

Open

jpaillard reviewed Jun 6, 2025

View reviewed changes

lionelkusch added 6 commits June 6, 2025 13:47

fix noise_mag

463d35b

improve test of knockoff

055bede

small improvement

d0df863

fix estimation of variance

062817c

clean noise_std

a6dbabd

change name of the sigma

59ea869

lionelkusch added 15 commits June 16, 2025 11:05

change mane of the noise fonction

0e4dfc5

Change parameter continous

695626f

fix range of paraemter for rho_noise_time

847e598

fix message of the error

db8d671

fix message of error

27ac293

fix test

9c3ee05

fix error in name parameters

04aed38

fix name in the error

f6c0e99

fix tests

bb5a236

fix generation of data parameters

64ff9ad

fix the missmathc of feature

cd73f6f

fix error when the support of noise was zero

319d02c

fix bug in the assertion

eee2700

fix assertion

624c040

Merge branch 'main' into PR_test_cpi

0f43f07

lionelkusch requested a review from bthirion June 17, 2025 09:11

bthirion reviewed Jun 17, 2025

View reviewed changes

lionelkusch and others added 13 commits June 18, 2025 14:54

Update src/hidimstat/_utils/scenario.py

a3d6e9a

Co-authored-by: bthirion <[email protected]>

transform beta in boolean array

476ea57

rename noise serial

3219c33

Improve comment of shuffle

2cd2ecf

done

2bb005e

remove assertion on number of jobs

1a2039b

Improve docstring

1bbb3b8

fix bug in tests

9c16ab6

fix tests cpi

fb18bd0

Remove unessesary tests

dae8f8d

fix test noise_std

158a293

replace n_times by n_targets

2b39b65

fix change name

8006811

Add tools for testing and make an example with CPI #265

Are you sure you want to change the base?

Add tools for testing and make an example with CPI #265

Uh oh!

Conversation

lionelkusch commented May 23, 2025

Uh oh!

bthirion left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lionelkusch Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bthirion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lionelkusch Jun 18, 2025 •

edited

Loading

jpaillard Jun 19, 2025 •

edited

Loading