Background
We need to determine whether two dlt.Dataset instances can be joined in a single SQL query. The current is_same_physical_destination (dlt/dataset/dataset.py:426) compares str(config) — which calls str(credentials), rendering the full connection URL including username, port, and database:
def is_same_physical_destination(dataset1, dataset2):
return str(dataset1.destination_client.config) == str(dataset2.destination_client.config)
This is too strict for destinations that support cross-database joins (Snowflake, BigQuery, MSSQL, ClickHouse, Databricks, Dremio, Athena) — they return False even when tables can be joined across databases on the same server/account.
It is semantically wrong for the filesystem destination where joinability comes from the in-process DuckDB query engine, not data location.
A fingerprint() method exists on DestinationClientConfiguration that hashes a location identifier (e.g. host), but it is not used by is_same_physical_destination and is missing on several destinations (DuckDB, Athena, SQLAlchemy, Fabric).
What to implement
1. Add location() to DestinationClientConfiguration
Returns the unhashed physical location string — the identity of where this destination lives. Replaces per-destination fingerprint() overrides. fingerprint() becomes derived on the base class and is no longer overridden per-destination:
def location(self) -> str:
"""Returns the physical location of this destination (e.g. host, bucket URL, project id)"""
return ""
def fingerprint(self) -> str:
"""Returns a hash of location()"""
loc = self.location()
return digest128(loc) if loc else ""
2. Add can_join_with(other) to DestinationClientConfiguration
Determines if tables from this destination can be joined with tables from other in a single query. This reflects the dlt destination's current capability, not the theoretical capability of the underlying database engine (e.g. DuckDB supports ATTACH for cross-database joins, but dlt doesn't implement it yet).
Default implementation compares locations. Destinations override for special semantics (e.g. filesystem, Postgres).
def can_join_with(self, other: "DestinationClientConfiguration") -> bool:
"""Check if tables from this destination can be joined with tables from other in a single query"""
self_loc = self.location()
other_loc = other.location()
if not self_loc or not other_loc:
return False
return self_loc == other_loc
3. Update is_same_physical_destination in dlt/dataset/dataset.py
Delegate to can_join_with on the config.
Join compatibility per destination
location() is the physical identity of the destination. can_join_with determines joinability — it defaults to comparing locations but can be overridden.
| Destination |
location() |
can_join_with rule |
Notes |
| Postgres |
host:port |
same host:port/database |
No cross-DB joins, override needed |
| Redshift |
host:port |
same host:port/database |
No cross-DB joins, override needed |
| Snowflake |
host (account identifier) |
same location (default) |
Cross-DB within account |
| BigQuery |
project_id |
same location (default) |
Cross-dataset within project |
| MSSQL |
host |
same location (default) |
Cross-DB via db.schema.table |
| Synapse |
host (inherited from MSSQL) |
same location (default) |
Cross-DB via db.schema.table |
| ClickHouse |
host |
same location (default) |
Cross-DB via db.table |
| Databricks |
server_hostname |
same location (default) |
Cross-catalog via Unity Catalog |
| Athena |
aws_data_catalog |
same location (default) |
Cross-DB within Glue catalog |
| Dremio |
host |
same location (default) |
Federation engine, cross-source |
| DuckDB |
database file path or ":memory:" |
same location (default) |
No ATTACH support in dlt currently |
| MotherDuck |
access token (account) |
same location (default) |
Cross-DB within account |
| Filesystem |
scheme://netloc or "" for local |
always True vs other filesystem, override needed |
In-process DuckDB scans files directly |
| DuckLake |
catalog connection string (no credentials) + ducklake_name |
same location (default) |
Catalog is the identity, not storage (paths are relative, bucket is swappable) |
SQLAlchemy — dialect-dependent can_join_with
SQLAlchemy's location() is host:port (or file path for sqlite). can_join_with depends on the backend dialect:
| Backend |
can_join_with rule |
postgresql |
same host:port/database (no cross-DB) |
mysql |
same host:port (cross-DB supported) |
mssql |
same host:port (cross-DB supported) |
oracle |
same host:port (cross-DB supported) |
db2 |
same host:port (cross-DB supported) |
sqlite |
same database file path |
| unknown |
same host:port/database (conservative default) |
Missing location() implementations
These destinations currently have no fingerprint() override and return "":
| Destination |
Needed location() |
| DuckDB |
database file path or ":memory:" |
| Athena |
aws_data_catalog identifier |
| SQLAlchemy |
host:port (or file path for sqlite) |
| Fabric |
host/workspace identifier from credentials |
Existing corrections needed
| Destination |
Current fingerprint() |
Problem |
Fix |
| DuckLake |
digest128(storage bucket) |
Storage is swappable via OVERRIDE_DATA_PATH; catalog is the real identity |
location() = catalog connection string (no credentials) + ducklake_name |
Background
We need to determine whether two
dlt.Datasetinstances can be joined in a single SQL query. The currentis_same_physical_destination(dlt/dataset/dataset.py:426) comparesstr(config)— which callsstr(credentials), rendering the full connection URL including username, port, and database:This is too strict for destinations that support cross-database joins (Snowflake, BigQuery, MSSQL, ClickHouse, Databricks, Dremio, Athena) — they return
Falseeven when tables can be joined across databases on the same server/account.It is semantically wrong for the filesystem destination where joinability comes from the in-process DuckDB query engine, not data location.
A
fingerprint()method exists onDestinationClientConfigurationthat hashes a location identifier (e.g. host), but it is not used byis_same_physical_destinationand is missing on several destinations (DuckDB, Athena, SQLAlchemy, Fabric).What to implement
1. Add
location()toDestinationClientConfigurationReturns the unhashed physical location string — the identity of where this destination lives. Replaces per-destination
fingerprint()overrides.fingerprint()becomes derived on the base class and is no longer overridden per-destination:2. Add
can_join_with(other)toDestinationClientConfigurationDetermines if tables from this destination can be joined with tables from
otherin a single query. This reflects the dlt destination's current capability, not the theoretical capability of the underlying database engine (e.g. DuckDB supports ATTACH for cross-database joins, but dlt doesn't implement it yet).Default implementation compares locations. Destinations override for special semantics (e.g. filesystem, Postgres).
3. Update
is_same_physical_destinationindlt/dataset/dataset.pyDelegate to
can_join_withon the config.Join compatibility per destination
location()is the physical identity of the destination.can_join_withdetermines joinability — it defaults to comparing locations but can be overridden.location()can_join_withrulehost:porthost:port/databasehost:porthost:port/databasehost(account identifier)project_idhostdb.schema.tablehost(inherited from MSSQL)db.schema.tablehostdb.tableserver_hostnameaws_data_cataloghost":memory:"scheme://netlocor""for localTruevs other filesystem, override neededducklake_nameSQLAlchemy — dialect-dependent
can_join_withSQLAlchemy's
location()ishost:port(or file path for sqlite).can_join_withdepends on the backend dialect:can_join_withrulepostgresqlhost:port/database(no cross-DB)mysqlhost:port(cross-DB supported)mssqlhost:port(cross-DB supported)oraclehost:port(cross-DB supported)db2host:port(cross-DB supported)sqlitehost:port/database(conservative default)Missing
location()implementationsThese destinations currently have no
fingerprint()override and return"":location()":memory:"aws_data_catalogidentifierhost:port(or file path for sqlite)Existing corrections needed
fingerprint()digest128(storage bucket)OVERRIDE_DATA_PATH; catalog is the real identitylocation()= catalog connection string (no credentials) +ducklake_name