TEAM DESTATIS: Repository for work on the UNECE HLG-MOS Synthetic Data Challenge

This was our (Team DESTATIS) repository for work on the United Nations Economic Commission for Europe (UNECE) High-level Group for the Modernisation of Statistical Production and Services (HLG-MOS) Synthetic Data Challenge 2022.

Background

The HLG-MOS Synthetical Data Challenge was all about exploring different methods, algorithms, metrics and utilities to create synthetic data. Synthetic Data could potentially be an interesting option for national statistical agencies to share data while maintaining public trust.

In order to be beneficial for certain use cases (e.g. 'release data to the public', 'release to trusted researchers', 'usage in education') the synthetic data needs to conserve certain statistical properties of the data. Thus, the synthetic data needs to be similar to the original data, but at the same time it has to be different to preserve privacy.

There is a lot of active research done on synthetic data and new methods for generating and evaluating confidentiality of synthetic data are emerging. The HLG-MOS has created a Synthetic Data Starter Guide to give national statistic offices an intro into this topic.

Figure 1: Results example Correlation Plot showing differences in correlations between a GAN created synthetic dataset and the original data

Goal

Goal of the challenge was to create synthetic versions of provided datasets and afterwards evaluate to what extent we would use this synthetic data for certain use cases. These use-cases were 'Releasing microdata to the public', 'Testing analysis', 'Education', 'Testing technology'.

One objective thereby was to evaluate as many different methods as possibly, while still trying to optimize parameters for the methods as good as possible.

Our team managed to create synthetic data with the following methods:

Fully Conditional Specification (FCS)
Generative Adversarial Network (GAN)
Probabilistic Graphical Models(Minutemen DP-pgm)
Information Preserving Statistical Obfuscation(IPSO)
Multivariate non-normal distribution (Simulated)

The other objective was to do this evaluation ideally for both of the provided original datasets. One dataset (SATGPA) being more of a toy example and the other (ACS) a more complex real-life dataset.

SATGPA: SAT (United States Standardized university Admissions Test) and GPA (university Grade Point Average) data, 6 features, 1.000 observations.
ACS: Demographic survey data (American Community Survey), 33 features, 1.035.201 observations.

So overall, it was about trying as many methods as possible, while still doing a quality evaluation (in terms of privacy and usability metrics) for each created synthetic dataset.

Final deliverables were:

A short 5 minute summary video, synthetic datasets, evaluation reports and an evaluation of the starter guide.

Figure 2: Results example histogram showing differences in distributions between a GAN created synthetic dataset and the original data

Team

Our team of the Federal Statistical Office of Germany (Statistisches Bundesamt) consited from five members of different groups within Destatis. Participating were:

Steffen M.
Reinhard T.
Felix G.
Michel R.
Hariolf M.

Repository Structure

Since it was a challenge in limited time and we were working in parallel the Github repository might look a little bit untidy. There are plenty of interesting things to find in the repository, here is a quick orientation:

Some larger files >100MB of our repo are unfortunately not linked, because of the max. Github file size allowed in the free tier.

Results

We ended up on 2nd place in the challenge leaderboard (which mostly expressed, how many methods a team successfully used to create and evaluate synthetic datasets). It might be interesting to look at our final overview slides.

Here is also a quick overview about some resulting metrics for the different methods:

Figure 3: Some of the utility metrics calculated for the created synthetic versions of the SATGAP dataset

Figure 4: Some of the privacy metrics calculated for the created synthetic versions of the SATGAP dataset

But, one key takeaway from this challenge is: One or two metrics are not able to tell the whole story. Some datasets showed e.g.really good usability results according to the pMSE metric, but the usability according to histogram comparisons was terrible. That is why in order to evaluate each method / created synthetic datasets we had to compile quite large reports using several different metrics.

These reports SATGPA-FCS, SATGPA-GAN, SATGPA-PGM, SATGPA-IPSO, SATGPA-SIM, ACS-FCS, ACS-GAN, ACS-PGM, ACS-IPSO, ACS-SIM (together with the datasets itself) were the main results.

To end this already quite lengthy Readme.md file - here is the SATGPA-GAN report as an report example:

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
1_Final_Reports_and_Results		1_Final_Reports_and_Results
2_Evaluation_ACS_FCS		2_Evaluation_ACS_FCS
2_Evaluation_ACS_GANe30		2_Evaluation_ACS_GANe30
2_Evaluation_ACS_IPSO_INCTOT		2_Evaluation_ACS_IPSO_INCTOT
2_Evaluation_ACS_IPSO_INCTOT_HHWTconf		2_Evaluation_ACS_IPSO_INCTOT_HHWTconf
2_Evaluation_ACS_Minuteman		2_Evaluation_ACS_Minuteman
2_Evaluation_ACS_SYND_SIMP		2_Evaluation_ACS_SYND_SIMP
2_Evaluation_ACS_final		2_Evaluation_ACS_final
2_Evaluation_SAT_FCS		2_Evaluation_SAT_FCS
2_Evaluation_SAT_GAN_1000		2_Evaluation_SAT_GAN_1000
2_Evaluation_SAT_IPSO		2_Evaluation_SAT_IPSO
2_Evaluation_SAT_Minuteman		2_Evaluation_SAT_Minuteman
2_Evaluation_SAT_SYND_SIMP_CMPLX		2_Evaluation_SAT_SYND_SIMP_CMPLX
2_Old_Evaluation_ACS_GANe10		2_Old_Evaluation_ACS_GANe10
2_Old_Evaluation_SAT_GAN		2_Old_Evaluation_SAT_GAN
2_Old_Evaluation_SAT_SYND_SIMP		2_Old_Evaluation_SAT_SYND_SIMP
3_minuteman_acs		3_minuteman_acs
3_minuteman_sat		3_minuteman_sat
4_models		4_models
5_results		5_results
Evaluation_ACS_FCS		Evaluation_ACS_FCS
img		img
reports		reports
.gitignore		.gitignore
0_Final_Slides_DESTATIS.pdf		0_Final_Slides_DESTATIS.pdf
6. GAN_ACS_Copula_full.Rmd		6. GAN_ACS_Copula_full.Rmd
6. GAN_ACS_full.Rmd		6. GAN_ACS_full.Rmd
6. GAN_SAT.Rmd		6. GAN_SAT.Rmd
6. GAN_SAT_Copula.Rmd		6. GAN_SAT_Copula.Rmd
7. ACS_fcs_test.R		7. ACS_fcs_test.R
7. FCS_SAT_Michel.R		7. FCS_SAT_Michel.R
7. FCS_SAT_synth		7. FCS_SAT_synth
7. fcs_sm.R		7. fcs_sm.R
7. fcs_synthpop_test.R		7. fcs_synthpop_test.R
7. gan_sm.R		7. gan_sm.R
7. ipso_acs_FG.R		7. ipso_acs_FG.R
7. ipso_satgpa_regdsc.R		7. ipso_satgpa_regdsc.R
7. ipso_sm.R		7. ipso_sm.R
7. sat_simulated_data.r		7. sat_simulated_data.r
7. sm_fcs_sat.R		7. sm_fcs_sat.R
7. sm_minuteman.R		7. sm_minuteman.R
LICENSE		LICENSE
README.html		README.html
README.md		README.md
Synthetic_Data_Challenge.Rproj		Synthetic_Data_Challenge.Rproj
acs.rdata		acs.rdata
acs_fcs_samp100000.RData		acs_fcs_samp100000.RData
avs_fcs_samp100000.RData		avs_fcs_samp100000.RData
example_sat.R		example_sat.R
install_suggested_packages.R		install_suggested_packages.R
metrics_sm.R		metrics_sm.R
pairs_test.png		pairs_test.png
pums.RData		pums.RData
pums_syn_syndatcart.RData		pums_syn_syndatcart.RData
satgpa.rda		satgpa.rda
sdnist.R		sdnist.R
superior_risk_measure.R		superior_risk_measure.R
superior_risk_measure_acs.R		superior_risk_measure_acs.R
syn_sat_mnorm_complex.RData		syn_sat_mnorm_complex.RData
syn_sat_mnorm_simple.RData		syn_sat_mnorm_simple.RData
test.rda		test.rda
utility_evaluation2.Rmd		utility_evaluation2.Rmd
utility_evaluation3.R		utility_evaluation3.R
utility_evaluation4.R		utility_evaluation4.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEAM DESTATIS: Repository for work on the UNECE HLG-MOS Synthetic Data Challenge

Background

Goal

Team

Repository Structure

Results

About

Releases

Packages

Contributors 5

Languages

License

SteffenMoritz/Synthetic_Data_Challenge

Folders and files

Latest commit

History

Repository files navigation

TEAM DESTATIS: Repository for work on the UNECE HLG-MOS Synthetic Data Challenge

Background

Goal

Team

Repository Structure

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages