Skip to content

feat: implement join compatibility check for destinations #3758

@rudolfix

Description

@rudolfix

Background

We need to determine whether two dlt.Dataset instances can be joined in a single SQL query. The current is_same_physical_destination (dlt/dataset/dataset.py:426) compares str(config) — which calls str(credentials), rendering the full connection URL including username, port, and database:

def is_same_physical_destination(dataset1, dataset2):
    return str(dataset1.destination_client.config) == str(dataset2.destination_client.config)

This is too strict for destinations that support cross-database joins (Snowflake, BigQuery, MSSQL, ClickHouse, Databricks, Dremio, Athena) — they return False even when tables can be joined across databases on the same server/account.

It is semantically wrong for the filesystem destination where joinability comes from the in-process DuckDB query engine, not data location.

A fingerprint() method exists on DestinationClientConfiguration that hashes a location identifier (e.g. host), but it is not used by is_same_physical_destination and is missing on several destinations (DuckDB, Athena, SQLAlchemy, Fabric).

What to implement

1. Add location() to DestinationClientConfiguration

Returns the unhashed physical location string — the identity of where this destination lives. Replaces per-destination fingerprint() overrides. fingerprint() becomes derived on the base class and is no longer overridden per-destination:

def location(self) -> str:
    """Returns the physical location of this destination (e.g. host, bucket URL, project id)"""
    return ""

def fingerprint(self) -> str:
    """Returns a hash of location()"""
    loc = self.location()
    return digest128(loc) if loc else ""

2. Add can_join_with(other) to DestinationClientConfiguration

Determines if tables from this destination can be joined with tables from other in a single query. This reflects the dlt destination's current capability, not the theoretical capability of the underlying database engine (e.g. DuckDB supports ATTACH for cross-database joins, but dlt doesn't implement it yet).

Default implementation compares locations. Destinations override for special semantics (e.g. filesystem, Postgres).

def can_join_with(self, other: "DestinationClientConfiguration") -> bool:
    """Check if tables from this destination can be joined with tables from other in a single query"""
    self_loc = self.location()
    other_loc = other.location()
    if not self_loc or not other_loc:
        return False
    return self_loc == other_loc

3. Update is_same_physical_destination in dlt/dataset/dataset.py

Delegate to can_join_with on the config.

Join compatibility per destination

location() is the physical identity of the destination. can_join_with determines joinability — it defaults to comparing locations but can be overridden.

Destination location() can_join_with rule Notes
Postgres host:port same host:port/database No cross-DB joins, override needed
Redshift host:port same host:port/database No cross-DB joins, override needed
Snowflake host (account identifier) same location (default) Cross-DB within account
BigQuery project_id same location (default) Cross-dataset within project
MSSQL host same location (default) Cross-DB via db.schema.table
Synapse host (inherited from MSSQL) same location (default) Cross-DB via db.schema.table
ClickHouse host same location (default) Cross-DB via db.table
Databricks server_hostname same location (default) Cross-catalog via Unity Catalog
Athena aws_data_catalog same location (default) Cross-DB within Glue catalog
Dremio host same location (default) Federation engine, cross-source
DuckDB database file path or ":memory:" same location (default) No ATTACH support in dlt currently
MotherDuck access token (account) same location (default) Cross-DB within account
Filesystem scheme://netloc or "" for local always True vs other filesystem, override needed In-process DuckDB scans files directly
DuckLake catalog connection string (no credentials) + ducklake_name same location (default) Catalog is the identity, not storage (paths are relative, bucket is swappable)

SQLAlchemy — dialect-dependent can_join_with

SQLAlchemy's location() is host:port (or file path for sqlite). can_join_with depends on the backend dialect:

Backend can_join_with rule
postgresql same host:port/database (no cross-DB)
mysql same host:port (cross-DB supported)
mssql same host:port (cross-DB supported)
oracle same host:port (cross-DB supported)
db2 same host:port (cross-DB supported)
sqlite same database file path
unknown same host:port/database (conservative default)

Missing location() implementations

These destinations currently have no fingerprint() override and return "":

Destination Needed location()
DuckDB database file path or ":memory:"
Athena aws_data_catalog identifier
SQLAlchemy host:port (or file path for sqlite)
Fabric host/workspace identifier from credentials

Existing corrections needed

Destination Current fingerprint() Problem Fix
DuckLake digest128(storage bucket) Storage is swappable via OVERRIDE_DATA_PATH; catalog is the real identity location() = catalog connection string (no credentials) + ducklake_name

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions