The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

chi2liu · 2022-09-26T09:39:08Z

In the implemntation of get_binned_data

evidently/src/evidently/calculations/stattests/utils.py

Line 33 in 4e4637d

np.place(reference_percents, reference_percents == 0, 0.0001)

By default, 0.0001 is used to fill the value with a percent of 0. It is valid when the minimum percent of non-zero is greater than 0.0001. But when the minimum percent in the data is less than or equal to 0.0001, like 0.00001. After we filling zeros, the original zero percent is now 0.0001, which is larger than the existing minimum percent(0.00001), and the data distribution is wrong after filling zeros.

The default value of fill zeros in this place, I think, should be set dynamically according to the percentage of data, rather than being fixed directly. How to set dynamically needs further discussion. For example, it should be set to one tenth of the minimum percent, or other values, but it must be guaranteed to be smaller than the minimum percent, otherwise it will be wrong.

I found this problem because when I tried to calculate Kullback Leibler divergence drift score on my data, I found that the minimum percent of my data was 0.00001, which was smaller than the default fill zeros, which resulted in incorrect Kullback Leibler divergence drift score. For example, the default calculated Kullback Leibler divergence drift score in my data is 0.41. After I change fill zeros to one tenth of the minimum percent, the calculated Kullback Leibler divergence drive score is 1.9.

emeli-dral · 2022-10-04T15:44:53Z

Hi @chi2liu,

Thanks for reporting it! We will not be able to address it immediately as the ongoing API update is currently a priority, but I added it to the list of feature requests we regularly review. Will get there!

We also welcome a contribution if you want to fix it!

AntonisCSt · 2022-10-09T17:30:35Z

Wow, I managed to mess up a bit with this issue by adding many different references in my local commits!
Sorry!!
(new to contributing here)

I added opened a pull request for this with a temporal change.

I added an if statement:

np.place(reference_percents, reference_percents == 0, min(reference_percents[reference_percents!=0])/10**6 if min(reference_percents[reference_percents!=0]) <= 0.0001 else 0.0001)
np.place(current_percents, current_percents == 0, min(current_percents[current_percents!=0])/10**6 if min(current_percents[current_percents!=0]) <= 0.0001 else 0.0001)

So, I basically did what you mentioned @chi2liu but instead of one-tenth I added 10^6. This is just to make it as close to zero (the actual value).

For your case (and other similar) this will increase the Kullback Leibler divergence drift score.

chi2liu · 2022-10-10T10:07:03Z

Wow, I managed to mess up a bit with this issue by adding many different references in my local commits! Sorry!! (new to contributing here)

I added opened a pull request for this with a temporal change.

I added an if statement:
np.place(reference_percents, reference_percents == 0, min(reference_percents[reference_percents!=0])/10**6 if min(reference_percents[reference_percents!=0]) <= 0.0001 else 0.0001)
np.place(current_percents, current_percents == 0, min(current_percents[current_percents!=0])/10**6 if min(current_percents[current_percents!=0]) <= 0.0001 else 0.0001)
So, I basically did what you mentioned @chi2liu but instead of one-tenth I added 10^6. This is just to make it as close to zero (the actual value).

For your case (and other similar) this will increase the Kullback Leibler divergence drift score.

@AntonisCSt I don't think it's meaningful to use 0.0001 as threshold. Maybe it should be taken as 10^6 of the minimum value for all. Because we want to make it as close to zero (the actual value) as you mentioned. Why is the exception for greater than 0.0001?
I don't know why it was set to 0.0001 by default at present, perhaps because of special considerations. @emeli-dral

AntonisCSt · 2022-10-10T10:22:43Z

@chi2liu yes good point. That definitely needs a discussion. I kept it so it could contain the initial logic of the creator. But I agree.

I will put an additional argument for my fix of why 10^6 and not 10^9. (hehe)

* fix for issue #334 feel zeroes of stattests utils * black fix

boemer00 · 2024-09-03T10:52:11Z

Hi @emeli-dral hope you are well.
I have submitted a pull request for this issue.
It should be sorted and I also suggested to update the parameter name from feel_zeroes to fill_zeroes for clarity.

thanks!

AntonisCSt · 2024-09-03T11:07:37Z

Definitely makes more sense now ^^

emeli-dral added the enhancement New feature or request label Oct 3, 2022

emeli-dral added hacktoberfest Accepted contributions will count towards your hacktoberfest PRs good first issue Good for newcomers labels Oct 6, 2022

AntonisCSt added a commit to AntonisCSt/evidently that referenced this issue Oct 9, 2022

fix statests utils.py feel zeroes from issue evidentlyai#334

6c3689e

AntonisCSt added a commit to AntonisCSt/evidently that referenced this issue Oct 9, 2022

added fix for stattest utils for feel zeroes of issue evidentlyai#334

495021b

AntonisCSt added a commit to AntonisCSt/evidently that referenced this issue Oct 9, 2022

fix for issue evidentlyai#334 feel zeroes of stattests utils

e33e2ac

AntonisCSt mentioned this issue Oct 9, 2022

fix for issue #334 feel zeroes of stattests utils #371

Merged

emeli-dral pushed a commit that referenced this issue Oct 26, 2022

fix for issue #334 feel zeroes of stattests utils (#371)

d0985f1

* fix for issue #334 feel zeroes of stattests utils * black fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

chi2liu commented Sep 26, 2022

emeli-dral commented Oct 4, 2022

AntonisCSt commented Oct 9, 2022

chi2liu commented Oct 10, 2022 •

edited

Loading

AntonisCSt commented Oct 10, 2022 •

edited

Loading

boemer00 commented Sep 3, 2024

AntonisCSt commented Sep 3, 2024 •

edited

Loading

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

Comments

chi2liu commented Sep 26, 2022

emeli-dral commented Oct 4, 2022

AntonisCSt commented Oct 9, 2022

chi2liu commented Oct 10, 2022 • edited Loading

AntonisCSt commented Oct 10, 2022 • edited Loading

boemer00 commented Sep 3, 2024

AntonisCSt commented Sep 3, 2024 • edited Loading

chi2liu commented Oct 10, 2022 •

edited

Loading

AntonisCSt commented Oct 10, 2022 •

edited

Loading

AntonisCSt commented Sep 3, 2024 •

edited

Loading