Investigate the logic of applying `CSTest` #163

npatki · 2022-07-14T20:52:50Z

Problem Description

As mentioned in #70, the current implementation of CSTest might not be entirely correct. Before applying CSTest, we currently normalize the frequencies of each category but in reality, scipy expects larger number (it mentions they should be at least 5).

Expected behavior

I'm filing this feature request to track the logic and utility of CSTest. If this is something we want to continue supporting, how can we change it to make it more accurate?

Additional context

Note that in #142, we are adding the TVComplement metric, a categorical similarity metric based on the Total Variation Distance. So even if we deprecate CSTest, there will still be another measure for categorical similarity.

The text was updated successfully, but these errors were encountered:

MartinKratky · 2022-09-04T19:32:55Z

Also noticed this behaviour when getting very high p-values even for synthetic data that did not seem to fit the real very well.

Furthermore, the scipy.stats.chisquare method is a one-way test and expects the number of observations to match:
"the sum of the observed and expected frequencies must be the same for the test to be valid"...

Following https://online.stat.psu.edu/stat415/lesson/17/17.1 the test statistic should be modified to account for that.

Why do you plan to deprecate CSTest?

npatki · 2022-09-05T18:01:01Z

Hi @MartinKratky, yes the frequencies must match and scipy recommends there be >13 data points. Currently, SDMetrics normalizes the data to 1, which is invalid for the test -- so I would not rely on the p-value that is currently returned.

Why do you plan to deprecate CSTest?

We have not yet made a final decision on this. However, we have added TVComplement metric if you'd like to play around with it. This calculates similarity using the Total Variation Distance.

from sdmetrics.single_column import TVComplement

TVComplement.compute(
    real_data['discrete_column_name'],
    synthetic_data['discrete_column_name']
)

Most synthetic data users are not interested in hypothesis testing (deciding whether the synthetic and real data are different); rather, they are interested in quantifying the difference. We find that a distance metric, not a p-value, is more suitable for this goal.

npatki added feature request Request for a new feature data:single-table Related to tabular datasets labels Jul 14, 2022

npatki mentioned this issue Jul 14, 2022

Why are the frequencies normalized in CSTest? #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the logic of applying `CSTest` #163

Investigate the logic of applying `CSTest` #163

npatki commented Jul 14, 2022

MartinKratky commented Sep 4, 2022

npatki commented Sep 5, 2022

Investigate the logic of applying CSTest #163

Investigate the logic of applying CSTest #163

Comments

npatki commented Jul 14, 2022

Problem Description

Expected behavior

Additional context

MartinKratky commented Sep 4, 2022

npatki commented Sep 5, 2022

Investigate the logic of applying `CSTest` #163

Investigate the logic of applying `CSTest` #163