-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-adding accuracy estimation to queries #502
Comments
The analytic formula for AVG accuracy ended up not being very useful. We now just simulate to get the accuracies. The Something to note is that you can get the simulated accuracies "for free" if you simulate on-top of the noisy answer, since all future runs count as post-processing. For example, this code example uses only epsilon of 1.0, and gives 95% bounds for AVG(income) and AVG(age): from copy import deepcopy
import pandas as pd
import tqdm
from snsql import Privacy, from_connection
pums = pd.read_csv("../datasets/PUMS.csv")
yaml = "../datasets/PUMS.yaml"
privacy = Privacy(epsilon=1.0, delta=1e-5)
conn = from_connection(pums, metadata=yaml, privacy=privacy)
query = "SELECT AVG(age), AVG(income) FROM PUMS.PUMS"
query_ast = conn.parse_query_string(query)
from_clause = deepcopy(query_ast.source)
subquery_ast, _ = conn._rewrite(query)
subquery_ast.source = from_clause
subquery = str(subquery_ast)
print(f"Original query: {query}")
print(f"Subquery: {subquery}")
inner = conn.execute(subquery)
# inner = conn.reader.execute(subquery)
print(f"Inner result: \n{inner}")
age = []
income = []
n_runs = 1000
for _ in tqdm.trange(n_runs):
noisy = conn._execute_ast(query_ast, pre_aggregated=inner)
age.append(noisy[1][0])
income.append(noisy[1][1])
alpha = 0.05
age = sorted(age)
income = sorted(income)
age_lower = age[int(n_runs * alpha / 2)]
age_upper = age[int(n_runs * (1 - alpha / 2))]
income_lower = income[int(n_runs * alpha / 2)]
income_upper = income[int(n_runs * (1 - alpha / 2))]
income_reported = inner[1][2] / inner[1][0]
age_reported = inner[1][1] / inner[1][0]
print(f"Reported age: {age_reported} will be in [{age_lower}, {age_upper}]")
print(f"Reported income: {income_reported} will be in [{income_lower}, {income_upper}]") This is a more advanced version of the example in the docs, where in the docs the caller needs to know that the AVG is composed of SUM and COUNT, in this example we use the query rewriter to extract that information. In this way, you could pass in any other formulas and still automatically obtain the right subquery needed. The commented-out line Note that the database only gets queried once in this sample. The 1000 calls to pre-aggregated all happen in-memory on the CPU without accessing the original data. We pass in the This approach is best for AVG and VAR when there are |
What would it take to get accuracy estimation back for more complex agg like avg, assuming we're just using laplace and gaussian?
(it's pretty urgent for us)
The text was updated successfully, but these errors were encountered: