DPGAN + PATECTGAN: strange behaviour increasing epsilon budget #606

GiuliaGualtieri · 2024-10-11T17:15:10Z

Issue Description

I run a scrip performing a series of operations to test the accuracy and privacy of two of the available data synthesis methods: DPGAN vs PATECTGAN. (script : run_comparison.py available in the attached MATERIAL-SMARTNOISE.zip )

Why if I increase the budget, the RandomForest classifier is still able to distinguish the private synthetized dataset from the original one? I expect that with epsilon converging to infinity I create in a certain way the “ideal” GAN that can reproduce perfectly the original distribution of the PUMS dataset and so the accuracy of the classifier decreases as the classifier is not able to distinguish the origins of the data.
While this does not happen? As you can see in this plot, the accuracy raises up to ~ 95%.

There is value near to epsilon = 5.0 where PATECGAN does 62%. I’m asking: why then it’s getting worse?
So, I decided to write you, in order to shed some lights about this behaviour cause maybe I’m doing something wrong during the training of NN or Random Forest binary classifier.

Environment

python=3.10

Commands

You could find all the scripts for running and compare the models in the attached MATERIAL-SMARTNOISE.zip .

Results

You could synthetic private data in csv format in the attached MATERIAL-SMARTNOISE.zip .

The text was updated successfully, but these errors were encountered:

joshua-oss · 2024-10-11T20:00:41Z

There is a general issue of "mode collapse" with GAN-based synthesizers on tabular data, where infrequent combinations of attributes get suppressed in the output, and the distribution is biased towards the most frequent categories. This is something that happens even without differential privacy. The CT (conditional tabular) family of GANs attempt to fix this issue by oversampling rare categories, but the normal way of doing this unfortunately violates differential privacy. In cases where categories are fairly uniformly distributed, this might not be a major problem, but in general the GAN synthesizers will have a limited ability to model the data, even if no privacy is applied.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPGAN + PATECTGAN: strange behaviour increasing epsilon budget #606

DPGAN + PATECTGAN: strange behaviour increasing epsilon budget #606

GiuliaGualtieri commented Oct 11, 2024

joshua-oss commented Oct 11, 2024

DPGAN + PATECTGAN: strange behaviour increasing epsilon budget #606

DPGAN + PATECTGAN: strange behaviour increasing epsilon budget #606

Comments

GiuliaGualtieri commented Oct 11, 2024

Issue Description

Environment

Commands

Results

joshua-oss commented Oct 11, 2024