Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate: answer to question "can observations be made public" into preprocessors and release text #295

Open
6 tasks
raprasad opened this issue Sep 10, 2021 · 5 comments · May be fixed by #667
Open
6 tasks

Comments

@raprasad
Copy link
Member

raprasad commented Sep 10, 2021

  • 1. create a checklist table of current stats and if/how the computation changes when the # of observations cannot be made public

See google doc: https://docs.google.com/document/d/1xUihcjh4zmfnhG0-2EC-uG-qzpde8WXphRksB0NvHe8/edit#

(Redo steps below after doc discussion)

  • 2. update the StatSpec class (stat_spec.py) to include a variable indicating is_dataset_size_public
  • 3. ^ update the computation chains for existing stats appropriately.
    • e.g. if the is_dataset_size_public == True, update the chain, use a different chain, etc.
    • include tests for each stat. (Check taht if the dataset size is private then more epsilon is used, etc.
  • 4. Integrate into larger workflow. e.g. ValidateReleaseUtil.build_stat_specs()
    • ValidateReleaseUtil.__init__ : add self.is_dataset_size_public = None
    • ValidateReleaseUtil.run_preliminary_steps: set self.is_dataset_size_public to True or False
    • Add function DatasetInfo.is_dataset_size_public()
      • similar to get_dataset_size()
      • except finds answer to the dataset question within DepositorSetupInfo
    • ValidateReleaseUtil.build_stat_specs(), user self.is_dataset_size_public when building the StatSpec objects
@raprasad raprasad added this to the Create Statistic milestone Sep 10, 2021
@raprasad raprasad changed the title profiler/epsilon question, can observations be made public Integrate: answer to question "can observations be made public" into preprocessors and release text Sep 23, 2021
@ecowan
Copy link
Contributor

ecowan commented May 23, 2022

There are two avenues here, each with its own set of logical steps:

Using DP Count:

  1. When the user selects private count = True, then the "create statistic" view should be pre-populated with a row for a DP count, the result of which will be passed into any other statistics that the user selects

  2. If the user selects private count = True and in "create statistic" selects a count, it should override the pre-populated one - we only need this to be calculated once.

Using User Estimation:

  1. One of the views (likely create statistic) needs a way for the user to specific their best estimation for the count, which is then passed to the backend and used in the computation chains.

  2. If a DP Count is also requested, then we would need to decide which takes precedence.

@raprasad @ekraffmiller

Thanks to @Shoeboxam for the discussion

@ecowan
Copy link
Contributor

ecowan commented May 23, 2022

Needed for computing DP counts:

  1. Select any one of the columns in the data set
  2. Set a parameter (epsilon/10, etc.) that determines how much budget should be used to calculate the count estimate
  3. Construct a new class with similar functionality to ValidateReleaseUtil that can return a DP count only
  4. Result of this class needs to be passed into ValidateReleaseTool to be used in the resize step of each statistic
  5. ValidateReleaseUtil also needs to lower the maximum_epsilon based on how much was used by the DP count

@raprasad
Copy link
Member Author

raprasad commented May 25, 2022

An old slide. We're not getting user input--yet.

This ticket is for implementing the green box labeled: "Use privacy budget to capture size"

2022-0525-iqss-dataflow_-_Google_Slides

@ecowan
Copy link
Contributor

ecowan commented Jun 1, 2022

@raprasad Why don't we approach this incrementally, and first build a feature where the user has to answer yes. This way, we can first develop the part of the code that takes the estimate from the front end and passes it into the process. Once this is merged, we can add functionality for the case where they say "no".

@ecowan
Copy link
Contributor

ecowan commented Jun 1, 2022

Another option is to create 2 analysis objects, one for the dp count and one for the rest, and split the budget between them. This way we could reuse the existing ValidateReleaseUtil class to compute what we need, rather than creating new classes to compute the dp count separately.

The workflow could look like this:

  1. User selects "count is private"
  2. Make two API calls to create new analyses, and link them to each other
  3. When dp count analysis completes, save the dp count to the analysis object
  4. When the second analysis runs, look to the linked analysis object and take the dp count from it

@raprasad raprasad assigned raprasad and unassigned ecowan Jun 23, 2022
raprasad added a commit that referenced this issue Jul 11, 2022
@raprasad raprasad modified the milestones: Create Statistic, Create stats fixes Jul 11, 2022
@raprasad raprasad linked a pull request Jul 22, 2022 that will close this issue
raprasad added a commit that referenced this issue Jul 25, 2022
raprasad added a commit that referenced this issue Jul 25, 2022
raprasad added a commit that referenced this issue Jul 25, 2022
raprasad added a commit that referenced this issue Jul 26, 2022
raprasad added a commit that referenced this issue Aug 2, 2022
@raprasad raprasad removed this from the Create stats fixes milestone Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment