Skip to content

Add tools for testing and make an example with CPI #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 78 commits into
base: main
Choose a base branch
from

Conversation

lionelkusch
Copy link
Collaborator

Following the discussion in issue #188, I merge the 3 functions for generating data.
Based on this new function, I created a set of tests for CPI. Once the framework for the test is defined, this will be extended to all the methods.

@lionelkusch lionelkusch requested review from bthirion and jpaillard May 23, 2025 18:03
@lionelkusch lionelkusch added the test Question link to tests label May 23, 2025
Copy link
Collaborator

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for opening this.
LGTM overall.
If you can split it into smaller PRs, it will make our life easier.

assert (
rho_noise_time >= 0.0 and rho_noise_time <= 1.0
), "rho_noise_time must be between 0 and 1"
assert snr >= 0.0, "snr must be positive"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can simplify the code a little by imposing snr>0 and removing the test L273. SNR of 0 is irrelevant as the entire function becomes useless since there is no need to select coefficients

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think that snr = 0.0 is not relevant?
The snr controls only the noise. By pointing to zeros, I want to consider the case where there is no additive noise.
This is the simplest and most native cases but it is always good for testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snr=0 means that the signal is negligible compared to the noise. That is, you only have noise. IMO you don't need a function for that, you can just do X, y = np.random.randn(), ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, snr=0, gives an error because it's there will be a division by zero.
I chose this value to disable the noise. Do you prefer to put snr=np.inf for disabling the noise?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be this way.

And if we want to allow passing snr=0, then we can test for it at the beginning and directly return noise instead of going through the whole function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep the function because I want to keep the effect of sigma and correlation for the noise.
I think now it should be better.

Comment on lines +72 to +74
assert imputation_model_continuous is None or issubclass(
imputation_model_continuous.__class__, BaseEstimator
), "Continous imputation model invalid"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this robust to different variations of imputation model such as OneVsRestClassifier, MultiOutputClassifier ... ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test here is for testing if the model is an estimator.
It doesn't test if it's a classifier or a regressor.

True coefficient vector/matrix with support_size non-zero elements.
non_zero : ndarray
Indices of non-zero coefficients in beta_true.
noise_factor : float
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consistently use snr in the API instead of noise_mag, noise_factor ...
It may add one line of code, but it improves clarity and uniformity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you mean?
The noise_mag is the coefficient that adds it to the noise to have the right ratio and snr is the signal noise ratio.
For me, these two elements are quite different.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two are redundant. You define the noise_mag as a function of the snr ( L272/275).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noise_mag is a result of the function and snr is a parameter of the function.
I don't see the redundancy here.
The only redundancy is sigma_noise and snr which are 2 scaling parameters of the same noise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you have the noise_mag and the noisy signal, you can get the snr; similarly, if you have snr, noisy signal, you can get the noise_mag.
My point is that having a consistent parameter controlling the noise level makes the code easier to understand and facilitates the API's use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have snr, noisy signal, you can get the noise_mag.

For me, it's not possible because you don't have the signal amplitude.
noise_mag is not a controlling variable, it's a result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we need noise_mag ?

Copy link
Collaborator Author

@lionelkusch lionelkusch Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noise_mag is the noise magnitude.
It corresponds to the final variance of the noise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but is it needed somewhere in the codebase. If not, I don't think it should be returned.

@lionelkusch lionelkusch requested a review from bthirion June 17, 2025 09:11
Copy link
Collaborator

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more step. Thx !

)
index_support = np.where(beta_true)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my point is: rather use beta_true directly in all subsequent calls. The experience shows that in 99% of the cases you don't need index_support, and moreover, the where API is a bit awkward.
(I used to use where a lot but this is a bad pattern)

)
index_support = np.where(beta_true)[0]
index_no_support = np.where(beta_true - 1)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might sonn revisit that and generate non-binary beta. For instance, it typically make sense to have different values to have differnt sensistivities on the variable. One may also want to check that negative beta work as well as positive beta.
For this reason, we should assume that beta_true is an arbitrary scalar.

Signal-to-noise ratio. Controls noise scaling.
sigma_noise : float, default=1.0
Standard deviation of the noise.
rho_noise_time : float, default=0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Rename rho_noise_time to rho_serial maybe (or even rho_samples if we change our naming as discussed above).

" to set variable groups. If no grouping is needed,"
" call fit with groups=None"
)
count = 0
for group_id in self.groups.values():
if type(group_id[0]) is int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a light check for checking if the number of features of X_test corresponds to the number of features of X_train.
I check that the id feature in the group corresponds to a row in the X_test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be np.all?
Also, I would not name the variable group_id since self.groups.values() actually contains variable ids, the group id would be self.groups.keys() or the index of the key if keys contains strings.

Finally, if we want to test this for arrays, why not test it for dataframes, making sure that the group members are in the list of column names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be np.all?
Yes, it should be np.all.

I change the name to index_variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, if we want to test this for arrays, why not test it for dataframes, making sure that the group members are in the list of column names?

For dataframe, it's a bit more complicated because getting the names of the columns in numpy or pandas is different.
I will try to add it.

), "X does not correspond to the fitting data."
count += len(group_id)
if X.shape[1] > count:
warnings.warn("Not all features will has a importance score.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. In which situation would that occur ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment of @jpaillard before:

In a scenario where you know that a variable is predictive, but you are not interested in measuring its importance.
For instance, in the diabetes dataset, one may not be interested in the importance of age, but still want to include it in the predictive model to get the importance of other variables conditionally to age. For instance, BloodPressure which is correlated with age.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider whether we have the right API for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@jpaillard jpaillard Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Measuring the importance of variables/groups that do not match the dimension of the data is a useful feature. For instance, to measure the importance of a few variables while controlling (conditioning) for others, for which we don't want to compute and report importance/p-values. Also, if we want to measure the importance of overlapping groups, for instance, in hierarchical clustering.

This feature is naturally supported by the current API, and I don't see any reason to prevent users from using it.
I would suggest removing the warning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this should be documented and showcased in an example, because it is too easy to miss it, and then people will get results that they don't understand.
The alternative is to consider that there are "features to be tested" and nuisance features to condition upon but that are not to be considered.

)
index_support = np.where(beta_true)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so support = beta_true != 0 (or np.abs(beta_true) > eps)

@@ -84,6 +87,8 @@ def fit(self, X, y=None, groups=None):
self._groups_ids = [
np.array(ids, dtype=int) for ids in list(self.groups.values())
]
else:
raise ValueError("groups needs to be a dictionnary")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do groups have the same semantics as in sklearn ? If yes, we should use the same type. I belive it is an array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the definition of groups by the glossary of sklearn, we don't use the same semantics.
The group is sklearn is only related to cross-validation. In our case, the group can be seen as a cluster of features. I didn't any equivalence in the glossary of sklearn.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, skelran groups are samples rather than features. Yet , what is the modtivation to have a dictionary ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation is to let the user provide some meaning for the group by adding a specific name to it if he wants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Question link to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to estimate signal-to-noise ratio? Data generation for tests Remove the usage of global constant
3 participants