Find nice datasets and use cases for anonymizating with PipelineDP #34

dvadym · 2021-06-02T13:39:21Z

This issue for tracking ideas of datasets and usecases of using PipelineDP.

Having datasets/use cases would be helpful for

showing how to work with PipelineDP (maybe in Colab)
testing new features
making utility or speed analysis of PipelineDP
learning new ways of applying DP (in case if use cases are different from known ones)

Some requirements on datasets:

they should contain tabular data
there should be some user data and (ideally) there should be way to quantify each user contributions
there should be interesting aggregated metrics about this datasets

Please add suggestions in comments.

dasmdasm · 2021-06-02T17:38:39Z

NIST ran a competition last year with the goal of producing differentially private statistics based on a dataset of emergency services calls. As part of that they released a dataset of 911 calls in Baltimore. That could make a decent example.

emmcauley · 2022-04-28T16:36:42Z

Is this ticket still of interest? If so, I may have some ideas for biomedical research datasets. Examples include:

Cancer Imaging Archive -- a collection of imaging, genomic, and clinical data, some of which are publicly available.
cBioPortal -- a collection of tabular clinical + genomic data

Additionally, GeCo is a tool to create and corrupt synthetic data.

The first two would be valuable because there are plenty of academic papers whose results we may be able to test/validate +/- DP.

dvadym · 2022-04-30T16:30:43Z

Thank you @emmcauley, those datasets look very interesting!

I know nothing about using DP in medical research, it would be interesting to learn more (and maybe try to use PipelineDP). Don't you know some links (papers, presentations, books, videos etc) for helping to understand this area?

emmcauley · 2022-05-03T21:08:25Z

I'm interested in making this topic more approachable to a broader audience and I'm happy to collate some additional resources here (it will take me a few days or so). In the meantime, do you have specific questions I can help address?

dvadym · 2022-05-22T07:30:34Z

For me personally, I'd be interested in some use cases of using DP in medical research, do you know some papers about that? It would help to understand what methods are used, and whether we can support that in PipelineDP.

Yeah, It would be great to make it more approachable. I'm happy to participate in this.

aaallami · 2022-08-19T09:30:16Z

Is this ticket still of interest? If so, I may have some ideas for biomedical research datasets. Examples include:
* [Cancer Imaging Archive](https://www.cancerimagingarchive.net/collections/) --  a collection of imaging, genomic, and clinical data, some of which are publicly available.

* [cBioPortal](https://www.cbioportal.org/datasets) -- a collection of tabular clinical + genomic data
Additionally, GeCo is a tool to create and corrupt synthetic data.

The first two would be valuable because there are plenty of academic papers whose results we may be able to test/validate +/- DP.

@emmcauley I am working on secure cancer prediction protocols using data mining techniques such as K-means and SVM. The problem domain is secure multiparty computation (MPC), where multiple parties own the data and would like to analyze it without revealing their input. However, I am open to shifting the domain to the DP if it is applicable since I believe it offers better performance than MPC. Nevertheless, this kind of problem is more toward accuracy rather than performance. Therefore, the question becomes, can DP still offer higher accuracy than MPC? I am in the early stages of this project. If you have any feedback regarding my directions for this project, please let me know.

dvadym added Type: Discussion 🔈 When further discussion and debate is required Type: Research 🔬 When further investigation into a subject is required labels Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find nice datasets and use cases for anonymizating with PipelineDP #34

Find nice datasets and use cases for anonymizating with PipelineDP #34

dvadym commented Jun 2, 2021

dasmdasm commented Jun 2, 2021

emmcauley commented Apr 28, 2022 •

edited

Loading

dvadym commented Apr 30, 2022

emmcauley commented May 3, 2022

dvadym commented May 22, 2022

aaallami commented Aug 19, 2022 •

edited

Loading

Find nice datasets and use cases for anonymizating with PipelineDP #34

Find nice datasets and use cases for anonymizating with PipelineDP #34

Comments

dvadym commented Jun 2, 2021

dasmdasm commented Jun 2, 2021

emmcauley commented Apr 28, 2022 • edited Loading

dvadym commented Apr 30, 2022

emmcauley commented May 3, 2022

dvadym commented May 22, 2022

aaallami commented Aug 19, 2022 • edited Loading

emmcauley commented Apr 28, 2022 •

edited

Loading

aaallami commented Aug 19, 2022 •

edited

Loading