Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the logic of applying CSTest #163

Open
npatki opened this issue Jul 14, 2022 · 2 comments
Open

Investigate the logic of applying CSTest #163

npatki opened this issue Jul 14, 2022 · 2 comments
Labels
data:single-table Related to tabular datasets feature request Request for a new feature

Comments

@npatki
Copy link
Contributor

npatki commented Jul 14, 2022

Problem Description

As mentioned in #70, the current implementation of CSTest might not be entirely correct. Before applying CSTest, we currently normalize the frequencies of each category but in reality, scipy expects larger number (it mentions they should be at least 5).

Expected behavior

I'm filing this feature request to track the logic and utility of CSTest. If this is something we want to continue supporting, how can we change it to make it more accurate?

Additional context

Note that in #142, we are adding the TVComplement metric, a categorical similarity metric based on the Total Variation Distance. So even if we deprecate CSTest, there will still be another measure for categorical similarity.

@npatki npatki added feature request Request for a new feature data:single-table Related to tabular datasets labels Jul 14, 2022
@MartinKratky
Copy link

Also noticed this behaviour when getting very high p-values even for synthetic data that did not seem to fit the real very well.

Furthermore, the scipy.stats.chisquare method is a one-way test and expects the number of observations to match:
"the sum of the observed and expected frequencies must be the same for the test to be valid"...

Following https://online.stat.psu.edu/stat415/lesson/17/17.1 the test statistic should be modified to account for that.

Why do you plan to deprecate CSTest?

@npatki
Copy link
Contributor Author

npatki commented Sep 5, 2022

Hi @MartinKratky, yes the frequencies must match and scipy recommends there be >13 data points. Currently, SDMetrics normalizes the data to 1, which is invalid for the test -- so I would not rely on the p-value that is currently returned.

Why do you plan to deprecate CSTest?

We have not yet made a final decision on this. However, we have added TVComplement metric if you'd like to play around with it. This calculates similarity using the Total Variation Distance.

from sdmetrics.single_column import TVComplement

TVComplement.compute(
    real_data['discrete_column_name'],
    synthetic_data['discrete_column_name']
)

Most synthetic data users are not interested in hypothesis testing (deciding whether the synthetic and real data are different); rather, they are interested in quantifying the difference. We find that a distance metric, not a p-value, is more suitable for this goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:single-table Related to tabular datasets feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants