Skip to content

Don't require setting data_path for existing DuckLake catalogs #3796

@kinghuang

Description

@kinghuang

Feature description

dlt's DuckLake implementation seems to require that the storage bucket_url be set and match the data_path in the catalog unless override_data_path is used (#3703). This is counterintuitive and makes it really hard to use an existing DuckLake catalog that is managed by another system where I don't know what the data_path option is set to ahead of time.

The data_path parameter is only required if a new catalog is being created. Otherwise, the data_path set in the catalog's options is used, unless override_data_path is specified. DuckLake should not require that data_path is set, nor require that it match whatever's in the catalog options.

Docs: Connecting — DuckLake

Error when setting a dummybucket_url.

<class 'dlt.destinations.exceptions.DestinationConnectionError'>
Connection with client_type=DuckLakeSqlClient to dataset_name=ghgsat_data failed. Please check if you configured the credentials at all and provided the right credentials values. You can be also denied access or your internet connection may be down. The actual reason given is: DATA_PATH parameter "s3://example/" does not match existing data path in the catalog "s3://dklk-cam-king-cam-main--usw2-az1--x-s3/king_cam_main/".
You can override the DATA_PATH by setting OVERRIDE_DATA_PATH to True.

Error when bucket_url is not set.

<class 'dlt.destinations.exceptions.DestinationConnectionError'>
Connection with client_type=DuckLakeSqlClient to dataset_name=ghgsat_data failed. Please check if you configured the credentials at all and provided the right credentials values. You can be also denied access or your internet connection may be down. The actual reason given is: DATA_PATH parameter "/Users/king/Repositories/github.com/SensorUp/-sdp-defs/sdp-ghgsat-defs/king_cam_main.files/" does not match existing data path in the catalog "s3://dklk-cam-king-cam-main--usw2-az1--x-s3/king_cam_main/".
You can override the DATA_PATH by setting OVERRIDE_DATA_PATH to True.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

I want to use an existing DuckLake catalog as a target, where I don't know what the catalog's data_path option is set to. I should not have to set it, nor should dlt check its value.

Proposed solution

Don't pass a data_path argument during attach if it's not set, and don't check if it's the same value as the catalog's data_path. The argument is only pertinent if a new catalog being created or if override_data_path is used.

Related issues

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    destinationIssue with a specific destination

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions