Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPGAN + PATECTGAN: strange behaviour increasing epsilon budget #606

Open
GiuliaGualtieri opened this issue Oct 11, 2024 · 1 comment
Open

Comments

@GiuliaGualtieri
Copy link

Issue Description

I run a scrip performing a series of operations to test the accuracy and privacy of two of the available data synthesis methods: DPGAN vs PATECTGAN. (script : run_comparison.py available in the attached MATERIAL-SMARTNOISE.zip )

Why if I increase the budget, the RandomForest classifier is still able to distinguish the private synthetized dataset from the original one? I expect that with epsilon converging to infinity I create in a certain way the “ideal” GAN that can reproduce perfectly the original distribution of the PUMS dataset and so the accuracy of the classifier decreases as the classifier is not able to distinguish the origins of the data.
While this does not happen? As you can see in this plot, the accuracy raises up to ~ 95%.
Accuracy_DPGAN_PATEGAN_log(epsilon)
There is value near to epsilon = 5.0 where PATECGAN does 62%. I’m asking: why then it’s getting worse?
So, I decided to write you, in order to shed some lights about this behaviour cause maybe I’m doing something wrong during the training of NN or Random Forest binary classifier.

Environment

  • python=3.10

Commands

You could find all the scripts for running and compare the models in the attached MATERIAL-SMARTNOISE.zip .

Results

You could synthetic private data in csv format in the attached MATERIAL-SMARTNOISE.zip .

@joshua-oss
Copy link
Contributor

There is a general issue of "mode collapse" with GAN-based synthesizers on tabular data, where infrequent combinations of attributes get suppressed in the output, and the distribution is biased towards the most frequent categories. This is something that happens even without differential privacy. The CT (conditional tabular) family of GANs attempt to fix this issue by oversampling rare categories, but the normal way of doing this unfortunately violates differential privacy. In cases where categories are fairly uniformly distributed, this might not be a major problem, but in general the GAN synthesizers will have a limited ability to model the data, even if no privacy is applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants