Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-adding accuracy estimation to queries #502

Open
FishmanL opened this issue Nov 9, 2022 · 1 comment
Open

Re-adding accuracy estimation to queries #502

FishmanL opened this issue Nov 9, 2022 · 1 comment

Comments

@FishmanL
Copy link
Contributor

FishmanL commented Nov 9, 2022

What would it take to get accuracy estimation back for more complex agg like avg, assuming we're just using laplace and gaussian?
(it's pretty urgent for us)

@joshua-oss
Copy link
Contributor

The analytic formula for AVG accuracy ended up not being very useful. We now just simulate to get the accuracies. The pre_aggregated option was added to facilitate this, documentation here.

Something to note is that you can get the simulated accuracies "for free" if you simulate on-top of the noisy answer, since all future runs count as post-processing. For example, this code example uses only epsilon of 1.0, and gives 95% bounds for AVG(income) and AVG(age):

from copy import deepcopy
import pandas as pd
import tqdm
from snsql import Privacy, from_connection

pums = pd.read_csv("../datasets/PUMS.csv")
yaml = "../datasets/PUMS.yaml"
privacy = Privacy(epsilon=1.0, delta=1e-5)
conn = from_connection(pums, metadata=yaml, privacy=privacy)

query = "SELECT AVG(age), AVG(income) FROM PUMS.PUMS"

query_ast = conn.parse_query_string(query)
from_clause = deepcopy(query_ast.source)

subquery_ast, _ = conn._rewrite(query)

subquery_ast.source = from_clause
subquery = str(subquery_ast)

print(f"Original query: {query}")
print(f"Subquery: {subquery}")

inner = conn.execute(subquery)
# inner = conn.reader.execute(subquery)

print(f"Inner result: \n{inner}")

age = []
income = []
n_runs = 1000
for _ in tqdm.trange(n_runs):
    noisy = conn._execute_ast(query_ast, pre_aggregated=inner)
    age.append(noisy[1][0])
    income.append(noisy[1][1])

alpha = 0.05

age = sorted(age)
income = sorted(income)
age_lower = age[int(n_runs * alpha / 2)]
age_upper = age[int(n_runs * (1 - alpha / 2))]
income_lower = income[int(n_runs * alpha / 2)]
income_upper = income[int(n_runs * (1 - alpha / 2))]

income_reported = inner[1][2] / inner[1][0]
age_reported = inner[1][1] / inner[1][0]

print(f"Reported age: {age_reported} will be in [{age_lower}, {age_upper}]")
print(f"Reported income: {income_reported} will be in [{income_lower}, {income_upper}]")

This is a more advanced version of the example in the docs, where in the docs the caller needs to know that the AVG is composed of SUM and COUNT, in this example we use the query rewriter to extract that information. In this way, you could pass in any other formulas and still automatically obtain the right subquery needed. The commented-out line conn.reader.execute could be uncommented to skip the differential privacy on the inner query to obtain exact results from the database, but this would then mean that you would be paying for privacy spend on every simulation run. Since the line above runs through the differentially private connection, the values obtained will instead be noisy, which means you can report them, and you can simulate "as if" they are the real values to get an estimate of accuracy impact.

Note that the database only gets queried once in this sample. The 1000 calls to pre-aggregated all happen in-memory on the CPU without accessing the original data. We pass in the query_ast to make them a little bit faster, but could just as easily pass in the query string. It takes about a minute on my machine, and we could probably speed that up substantially by skipping the repeated validation steps (since it's the same query, it only needs to be validated once).

This approach is best for AVG and VAR when there are GROUP BY with large differences in partition size. For example, if reporting statistics by county or ZIP code, there will be some partitions that are gigantic, and others with very few members, and in that case the errors will be very different per partition. Our preferred way to handle that is to simulate exactly as above, but bin the noisy values by the noisy count of members. For example, collect all noisy AVG and VAR values for bins with between 1-100 members, bins with 100-500, 500-2500, and 2500 and above. Then we report the average +/- error for each bin size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants