Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

Open
chi2liu opened this issue Sep 26, 2022 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers hacktoberfest Accepted contributions will count towards your hacktoberfest PRs

Comments

@chi2liu
Copy link

chi2liu commented Sep 26, 2022

In the implemntation of get_binned_data

np.place(reference_percents, reference_percents == 0, 0.0001)

By default, 0.0001 is used to fill the value with a percent of 0. It is valid when the minimum percent of non-zero is greater than 0.0001. But when the minimum percent in the data is less than or equal to 0.0001, like 0.00001. After we filling zeros, the original zero percent is now 0.0001, which is larger than the existing minimum percent(0.00001), and the data distribution is wrong after filling zeros.

The default value of fill zeros in this place, I think, should be set dynamically according to the percentage of data, rather than being fixed directly. How to set dynamically needs further discussion. For example, it should be set to one tenth of the minimum percent, or other values, but it must be guaranteed to be smaller than the minimum percent, otherwise it will be wrong.

I found this problem because when I tried to calculate Kullback Leibler divergence drift score on my data, I found that the minimum percent of my data was 0.00001, which was smaller than the default fill zeros, which resulted in incorrect Kullback Leibler divergence drift score. For example, the default calculated Kullback Leibler divergence drift score in my data is 0.41. After I change fill zeros to one tenth of the minimum percent, the calculated Kullback Leibler divergence drive score is 1.9.

@emeli-dral emeli-dral added the enhancement New feature or request label Oct 3, 2022
@emeli-dral
Copy link
Contributor

Hi @chi2liu,

Thanks for reporting it! We will not be able to address it immediately as the ongoing API update is currently a priority, but I added it to the list of feature requests we regularly review. Will get there!

We also welcome a contribution if you want to fix it!

@emeli-dral emeli-dral added hacktoberfest Accepted contributions will count towards your hacktoberfest PRs good first issue Good for newcomers labels Oct 6, 2022
AntonisCSt added a commit to AntonisCSt/evidently that referenced this issue Oct 9, 2022
AntonisCSt added a commit to AntonisCSt/evidently that referenced this issue Oct 9, 2022
AntonisCSt added a commit to AntonisCSt/evidently that referenced this issue Oct 9, 2022
@AntonisCSt
Copy link
Contributor

Wow, I managed to mess up a bit with this issue by adding many different references in my local commits!
Sorry!!
(new to contributing here)

I added opened a pull request for this with a temporal change.

I added an if statement:

np.place(reference_percents, reference_percents == 0, min(reference_percents[reference_percents!=0])/10**6 if min(reference_percents[reference_percents!=0]) <= 0.0001 else 0.0001)
np.place(current_percents, current_percents == 0, min(current_percents[current_percents!=0])/10**6 if min(current_percents[current_percents!=0]) <= 0.0001 else 0.0001)

So, I basically did what you mentioned @chi2liu but instead of one-tenth I added 10^6. This is just to make it as close to zero (the actual value).

For your case (and other similar) this will increase the Kullback Leibler divergence drift score.

@chi2liu
Copy link
Author

chi2liu commented Oct 10, 2022

Wow, I managed to mess up a bit with this issue by adding many different references in my local commits! Sorry!! (new to contributing here)

I added opened a pull request for this with a temporal change.

I added an if statement:

np.place(reference_percents, reference_percents == 0, min(reference_percents[reference_percents!=0])/10**6 if min(reference_percents[reference_percents!=0]) <= 0.0001 else 0.0001)
np.place(current_percents, current_percents == 0, min(current_percents[current_percents!=0])/10**6 if min(current_percents[current_percents!=0]) <= 0.0001 else 0.0001)

So, I basically did what you mentioned @chi2liu but instead of one-tenth I added 10^6. This is just to make it as close to zero (the actual value).

For your case (and other similar) this will increase the Kullback Leibler divergence drift score.

@AntonisCSt I don't think it's meaningful to use 0.0001 as threshold. Maybe it should be taken as 10^6 of the minimum value for all. Because we want to make it as close to zero (the actual value) as you mentioned. Why is the exception for greater than 0.0001?
I don't know why it was set to 0.0001 by default at present, perhaps because of special considerations. @emeli-dral

@AntonisCSt
Copy link
Contributor

AntonisCSt commented Oct 10, 2022

@chi2liu yes good point. That definitely needs a discussion. I kept it so it could contain the initial logic of the creator. But I agree.

I will put an additional argument for my fix of why 10^6 and not 10^9. (hehe)

emeli-dral pushed a commit that referenced this issue Oct 26, 2022
* fix for issue #334 feel zeroes of stattests utils

* black fix
@boemer00
Copy link

boemer00 commented Sep 3, 2024

Hi @emeli-dral hope you are well.
I have submitted a pull request for this issue.
It should be sorted and I also suggested to update the parameter name from feel_zeroes to fill_zeroes for clarity.

thanks!

@AntonisCSt
Copy link
Contributor

AntonisCSt commented Sep 3, 2024

Definitely makes more sense now ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers hacktoberfest Accepted contributions will count towards your hacktoberfest PRs
Projects
None yet
Development

No branches or pull requests

4 participants