Skip to content

add join compatibility based on physical_location to destination configurations#3905

Open
Travior wants to merge 22 commits into
develfrom
feat/join_compat
Open

add join compatibility based on physical_location to destination configurations#3905
Travior wants to merge 22 commits into
develfrom
feat/join_compat

Conversation

@Travior

@Travior Travior commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

These are the currently enforced rules, together with the rules that diverge from the original ticket:
Another note: the accessor is called physical_destination(), as big query already defines a config property called destination.

Here are the tables formatted in Markdown:

Compatibility Overview vs Spec

Destination Current Location / Identity Current can_join_with Rule Divergence From Spec
Postgres host:port Same host:port and same database Matches
Redshift host:port Same host:port and same database Matches
Snowflake account host Same account host Matches
BigQuery project_id Same project ID Matches
MSSQL host:port Same host:port; database may differ Minor: spec says host, implementation includes port
Synapse host:port Same host:port; database may differ Minor: spec says inherited host, implementation includes port
ClickHouse host:port Same host:port; database may differ Minor: spec says host, implementation includes port
Databricks server_hostname Same server hostname Matches
Athena region/catalog Same AWS region and same data catalog Diverges: spec says location is only aws_data_catalog; implementation also requires same AWS region
Dremio host:port Same host:port Minor: spec says host, implementation includes port
DuckDB database file path or :memory: Same database path Matches
MotherDuck physical_destination() is empty; fingerprint/token used separately Same non-empty access token Diverges structurally: spec says location is access token and default same-location rule; implementation intentionally does not expose token as location and overrides can_join_with
Filesystem remote: scheme://netloc; local: "" Always True with another filesystem config Matches
DuckLake credential-free catalog identity + ducklake_name Same catalog identity and same DuckLake name Matches

SQLAlchemy Overview vs Spec

Dialect Current Rule Divergence From Spec
postgresql Same host:port and same database Matches
mysql Same host:port; database may differ Matches
mssql Same host:port; database may differ Matches
oracle Same host:port; database may differ Matches
db2 Same host:port; database may differ Matches
sqlite Same database file path Matches
unknown/other Same host:port and same database Matches

Additional Current Definitions Not In Spec

Destination Current Rule
LanceDB Same LanceDB URI and same dataset_separator; dataset name may differ
Lance Same catalog root and same bound dataset name
Weaviate Never joinable
Qdrant Never joinable

Still TODO/unclear:

  • CI secrets.toml (or config?) needs destination.snowflake.join_compatibility_database
  • integration tests intentionally don't execute explicit joins yet as query APIs don't properly support cross destination/database data access (e.g. snowflake qualified name doesn't include the database / filesystem client doesn't expose option to register "foreign" file as duckdb view)
  • Should we explicitly add more databases to the tests (when destination supports the join like snowflake), if so we need to manually set up different more databases / catalogs for those providers

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Apr 29, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 328785c Commit Preview URL

Branch Preview URL
Jun 10 2026, 10:37 AM

@Travior Travior self-assigned this Apr 29, 2026
@Travior Travior force-pushed the feat/join_compat branch 3 times, most recently from ce0fc0d to 274bcbb Compare April 29, 2026 13:53
@burnash burnash linked an issue May 11, 2026 that may be closed by this pull request
@Travior Travior force-pushed the feat/join_compat branch 2 times, most recently from ca6f0d4 to a5c8b91 Compare May 19, 2026 10:54
@Travior Travior marked this pull request as ready for review May 19, 2026 12:57
@Travior Travior requested a review from rudolfix May 19, 2026 12:58

@rudolfix rudolfix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think you wanted to name physical_destination -> physical_location?
  2. fingerprint contract was changed. let's discuss that. impact is potentially pretty big
  3. why sqlalchemy fingerprint test is passing? was it compatible or we do not test it at all?
  4. we need another ticket to extend #3747 - to allow to ATTACH one duckdb client to another. I can take care of that
  5. Test setup in load/pipeline is great. let's reuse it for end to end tests for #3747

Comment thread dlt/dataset/dataset.py
Comment thread dlt/destinations/impl/duckdb/configuration.py
# FilesystemConfiguration.fingerprint() (which hashes the raw bucket URL)
# over DestinationClientConfiguration.fingerprint() (which hashes
# physical_destination()). Do not remove.
return DestinationClientStagingConfiguration.fingerprint(self)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please tests this. MRO can change ie. someone may add another base class to it

Comment thread dlt/destinations/impl/filesystem/configuration.py Outdated
can access multiple storage backends in a single query, so join
compatibility is determined by the engine, not by the storage location.
"""
if isinstance(other, FilesystemDestinationClientConfiguration):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this will require extension of #3747 - ATTACH foreign duckdb with all its views and tables to current duckdb in the dataset and using 3 part qualification when binding queries. should be "easily" doable. not this ticket though!

Comment thread dlt/destinations/impl/motherduck/configuration.py
return host
return ""

def can_join_with(self, other: DestinationClientConfiguration) -> bool:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could eliminate some code below (IMO) via 2 helper classes:

  • we can join two physical destinations
  • as above but also database name must mach

helper 1 can call helper 2. also helper 1 looks like super().can_join_with() where we just compare physical locations.

Comment thread dlt/common/destination/client.py Outdated
Comment thread dlt/common/destination/client.py
Comment thread tests/load/pipeline/test_join_compatibility.py
@rudolfix

rudolfix commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Automated review brought this (I didn't verify):

Duplicate parametrize cases (tests/destinations/test_join_compatibility.py:407-420): fabric_port and fabric_default_port are byte-identical — same factory, same expected value. One
should set an explicit port (e.g., credentials.port = 1433) to actually distinguish "default port" from "explicit port".

@Travior Travior force-pushed the feat/join_compat branch from 6346996 to 3a5154d Compare June 3, 2026 11:43
@blacksmith-sh

This comment has been minimized.

@Travior Travior force-pushed the feat/join_compat branch from 12186ec to 2a6cbf6 Compare June 3, 2026 12:16
@blacksmith-sh

This comment has been minimized.

@rudolfix rudolfix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really good and close to be merged. I pushed a commit here:

  • it splits ability to read data from (select ... join) from ability to write data from (INSERT SELECT) - this will be needed for transformations engine. also not all engines will be SQL based in the future
  • I fixed a few obscure data location issues
  • filesystem - I decided to use physical location to check joinability. we'll replace when auto attach will be implemented

what remains:

  • Final was removed in several places
  • bigquery - so it seems it can join based on location. and we already have location() on the configuration. so we could just rename physical_location to it. please confirm
  • tests are very good but if you want to split this large file I'd move it to where finerprint tests are... not needed now though!

Comment thread dlt/dataset/dataset.py Outdated
Comment thread dlt/destinations/impl/databricks/configuration.py

def physical_location(self) -> str:
"""Returns configured project id, falling back to credentials."""
project_id = self.project_id

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked again and bigquery can join based on location. so here just return self.location.
https://docs.cloud.google.com/bigquery/docs/locations#specify_locations

so now it looks like self.location() is just right! and we can adopt this name everywhere like in the original ticket.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be right from what I read. The question is do we want to mix the derived location() that will then be overridden with the location attribute only for bigquery? In my mind keeping physical_location() as the universal accessor (even though the wiring for big query is just dead boilerplate) saves a couple headaches.

Comment thread dlt/destinations/impl/databricks/configuration.py Outdated
Comment thread tests/destinations/test_join_compatibility.py
Comment thread tests/destinations/test_join_compatibility.py Outdated
Comment thread tests/destinations/test_join_compatibility.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement join compatibility check for destinations

2 participants