-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joins don't seem to be working #512
Comments
is there some other way I should be doing this? I tried just doing a correlated subquery with |
the relevant metadata is
|
Joins using USING give me "Support for JOIN queries is currently disabled"? the docs seem to imply that joins are usable though. |
Yes, joins are disabled. I will add a section to the docs with more details, and the recommended workaround. A join typically results in an unbounded one-to-may projection of the private identifier. For example, if we have customers and orders, and the max order size is (e.g.) $500, we can't use sensitivity of 500, since we don't know the maximum number of orders each customer can have. Adding or removing a single individual from the dataset could impact the sum in an unbounded fashion, since tables typically don't have constraints that limit the number of joined records (e.g. a limit on number of orders). SmartNoise automatically deals with the unbounded contribution problem by performing reservoir sampling to bound the contribution. So, for example, if you have So the recommended workaround is for the data curator to expose the data to SmartNoise in a denormalized format, such as a view that performs the join. To SmartNoise and the analyst, it just looks like a single table where each user identifier can appear more than once. Of course, the metadata needs to match this view's schema, and the data curator needs to choose a suitable value of It's possible for SmartNoise to do that process under the covers: we can take a SQL query with a JOIN statement, expand to the full tabular output representation, merge in the sensitivities and do reservoir sampling. The code is actually all there, just disabled. A primary challenge was that the step to bound unbounded user contribution was opaque to the data curator and analyst, so they could be getting unexpected results without understanding why. Requiring the metadata to match a denormalized view makes things more clear for both parties. That's not an insurmountable problem, I'm just explain the rationale. The other challenge is that support for free-form analyst-specified joins required a lot more validation to ensure that malicious analysts couldn't breach privacy. There are various tricks with subqueries and self-joins that can be used to try to single out or explode a user's contribution. We have some validation code to catch such cases, but have low confidence in being able to detect every case. |
I'm trying to join between two tables that both have the same private_id column setm and trying to join on that private_id, using
select count(*) from public.drug_era join public.person on drug_era.person_id = person.person_id
, and the output I'm getting is "We only know how to handle tables with one key: [<snsql._ast.ast.TableColumn object at 0x7feb93757f70>, <snsql._ast.ast.TableColumn object at 0x7feb93721a00>]"The text was updated successfully, but these errors were encountered: