Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins don't seem to be working #512

Open
FishmanL opened this issue Nov 30, 2022 · 4 comments
Open

Joins don't seem to be working #512

FishmanL opened this issue Nov 30, 2022 · 4 comments

Comments

@FishmanL
Copy link
Contributor

I'm trying to join between two tables that both have the same private_id column setm and trying to join on that private_id, using
select count(*) from public.drug_era join public.person on drug_era.person_id = person.person_id, and the output I'm getting is "We only know how to handle tables with one key: [<snsql._ast.ast.TableColumn object at 0x7feb93757f70>, <snsql._ast.ast.TableColumn object at 0x7feb93721a00>]"

@FishmanL
Copy link
Contributor Author

is there some other way I should be doing this? I tried just doing a correlated subquery with select count(*) from public.drug_era where person_id in (select person_id from public.person) but that didnt work either

@FishmanL
Copy link
Contributor Author

FishmanL commented Nov 30, 2022

the relevant metadata is

DRUG_ERA:
      drug_concept_id:
        type: int
      drug_era_end_date:
        type: datetime
      drug_era_id:
        type: int
      drug_era_start_date:
        type: datetime
      drug_exposure_count:
        lower: 1
        type: int
        upper: 101
      gap_days:
        lower: 0
        type: int
        upper: 125
      person_id:
        private_id: true
        type: int
 PERSON:
      birth_datetime:
        type: datetime
      care_site_id:
        type: int
      day_of_birth:
        lower: 1
        type: int
        upper: 31
      ethnicity_concept_id:
        type: int
      ethnicity_source_concept_id:
        type: int
      ethnicity_source_value:
        type: string
      gender_concept_id:
        lower: 8507
        type: int
        upper: 8532
      gender_source_concept_id:
        type: int
      gender_source_value:
        type: string
      location_id:
        type: int
      max_ids: 1
      month_of_birth:
        lower: 1
        type: int
        upper: 12
      person_id:
        private_id: true
        type: int
      person_source_value:
        type: string
      provider_id:
        type: int
      race_concept_id:
        type: int
      race_source_concept_id:
        type: int
      race_source_value:
        type: string
      year_of_birth:
        lower: 1908
        type: int
        upper: 2022```

@FishmanL
Copy link
Contributor Author

FishmanL commented Nov 30, 2022

Joins using USING give me "Support for JOIN queries is currently disabled"? the docs seem to imply that joins are usable though.

@joshua-oss
Copy link
Contributor

Yes, joins are disabled. I will add a section to the docs with more details, and the recommended workaround.

A join typically results in an unbounded one-to-may projection of the private identifier. For example, if we have customers and orders, and the max order size is (e.g.) $500, we can't use sensitivity of 500, since we don't know the maximum number of orders each customer can have. Adding or removing a single individual from the dataset could impact the sum in an unbounded fashion, since tables typically don't have constraints that limit the number of joined records (e.g. a limit on number of orders).

SmartNoise automatically deals with the unbounded contribution problem by performing reservoir sampling to bound the contribution. So, for example, if you have max_ids set to 1, and you feed a table which has denormalized customers and orders, SmartNoise will randomly select just a single order per user and drop the rest. Setting max_ids to 5 would allow it to randomly select 5 orders per customer, at the expense of quintupling the sensitivity and required noise scale. Tuning this parameter is domain-specific and requires some expertise from the data curator.

So the recommended workaround is for the data curator to expose the data to SmartNoise in a denormalized format, such as a view that performs the join. To SmartNoise and the analyst, it just looks like a single table where each user identifier can appear more than once. Of course, the metadata needs to match this view's schema, and the data curator needs to choose a suitable value of max_ids that balances between utility and privacy. Then, SmartNoise automatically does the appropriate reservoir sampling and expansion of the sensitivity.

It's possible for SmartNoise to do that process under the covers: we can take a SQL query with a JOIN statement, expand to the full tabular output representation, merge in the sensitivities and do reservoir sampling. The code is actually all there, just disabled. A primary challenge was that the step to bound unbounded user contribution was opaque to the data curator and analyst, so they could be getting unexpected results without understanding why. Requiring the metadata to match a denormalized view makes things more clear for both parties. That's not an insurmountable problem, I'm just explain the rationale. The other challenge is that support for free-form analyst-specified joins required a lot more validation to ensure that malicious analysts couldn't breach privacy. There are various tricks with subqueries and self-joins that can be used to try to single out or explode a user's contribution. We have some validation code to catch such cases, but have low confidence in being able to detect every case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants