add join compatibility based on physical_location to destination configurations#3905
add join compatibility based on physical_location to destination configurations#3905Travior wants to merge 22 commits into
physical_location to destination configurations#3905Conversation
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
docs | 328785c | Commit Preview URL Branch Preview URL |
Jun 10 2026, 10:37 AM |
ce0fc0d to
274bcbb
Compare
ca6f0d4 to
a5c8b91
Compare
rudolfix
left a comment
There was a problem hiding this comment.
- I think you wanted to name
physical_destination->physical_location? fingerprintcontract was changed. let's discuss that. impact is potentially pretty big- why sqlalchemy fingerprint test is passing? was it compatible or we do not test it at all?
- we need another ticket to extend #3747 - to allow to ATTACH one duckdb client to another. I can take care of that
- Test setup in
load/pipelineis great. let's reuse it for end to end tests for #3747
| # FilesystemConfiguration.fingerprint() (which hashes the raw bucket URL) | ||
| # over DestinationClientConfiguration.fingerprint() (which hashes | ||
| # physical_destination()). Do not remove. | ||
| return DestinationClientStagingConfiguration.fingerprint(self) |
There was a problem hiding this comment.
please tests this. MRO can change ie. someone may add another base class to it
| can access multiple storage backends in a single query, so join | ||
| compatibility is determined by the engine, not by the storage location. | ||
| """ | ||
| if isinstance(other, FilesystemDestinationClientConfiguration): |
There was a problem hiding this comment.
NOTE: this will require extension of #3747 - ATTACH foreign duckdb with all its views and tables to current duckdb in the dataset and using 3 part qualification when binding queries. should be "easily" doable. not this ticket though!
| return host | ||
| return "" | ||
|
|
||
| def can_join_with(self, other: DestinationClientConfiguration) -> bool: |
There was a problem hiding this comment.
you could eliminate some code below (IMO) via 2 helper classes:
- we can join two physical destinations
- as above but also database name must mach
helper 1 can call helper 2. also helper 1 looks like super().can_join_with() where we just compare physical locations.
|
Automated review brought this (I didn't verify):
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
rudolfix
left a comment
There was a problem hiding this comment.
this is really good and close to be merged. I pushed a commit here:
- it splits ability to read data from (select ... join) from ability to write data from (INSERT SELECT) - this will be needed for transformations engine. also not all engines will be SQL based in the future
- I fixed a few obscure data location issues
- filesystem - I decided to use physical location to check joinability. we'll replace when auto attach will be implemented
what remains:
- Final was removed in several places
- bigquery - so it seems it can join based on location. and we already have
location()on the configuration. so we could just renamephysical_locationto it. please confirm - tests are very good but if you want to split this large file I'd move it to where finerprint tests are... not needed now though!
|
|
||
| def physical_location(self) -> str: | ||
| """Returns configured project id, falling back to credentials.""" | ||
| project_id = self.project_id |
There was a problem hiding this comment.
I checked again and bigquery can join based on location. so here just return self.location.
https://docs.cloud.google.com/bigquery/docs/locations#specify_locations
so now it looks like self.location() is just right! and we can adopt this name everywhere like in the original ticket.
There was a problem hiding this comment.
You seem to be right from what I read. The question is do we want to mix the derived location() that will then be overridden with the location attribute only for bigquery? In my mind keeping physical_location() as the universal accessor (even though the wiring for big query is just dead boilerplate) saves a couple headaches.
These are the currently enforced rules, together with the rules that diverge from the original ticket:
Another note: the accessor is called
physical_destination(), as big query already defines a config property calleddestination.Here are the tables formatted in Markdown:
Compatibility Overview vs Spec
can_join_withRuleSQLAlchemy Overview vs Spec
Additional Current Definitions Not In Spec
Still TODO/unclear:
destination.snowflake.join_compatibility_database