BigQuery destination fails to use GCS as staging filesystem #2384

trymzet · 2025-03-06T13:59:09Z

dlt version

1.5.0

Describe the problem

BigQuery fails when trying to use the same filesystem config which works when I use ClickHouse as destination, ie. GCS in s3 compatibility mode (this setting is required by the ClickHouse destination). I get this error:

google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/<my-project>/jobs?prettyPrint=false: Source URI must be a Google Cloud Storage location: s3://<my_bucket>/<my_path>/<my_file>.parquet

It seems this should be working, according to #1272

Expected behavior

I should be able to reuse the same staging config across any destination.

Steps to reproduce

# config.toml
[destination.filesystem]
bucket_url = "s3://<my-bucket>/<my-prefix>"

# my_pipeline.py
# (endpoint_url is set to "https://storage.googleapis.com")
pipe = dlt.pipeline(
        destination=dlt.destinations.bigquery(
            credentials=GcpServiceAccountCredentials(**gcp_credentials)
        ),
        staging=dlt.destinations.filesystem(
            credentials=AwsCredentials(
                aws_access_key_id=gcs_credentials["aws_access_key_id"],
                aws_secret_access_key=gcs_credentials["aws_secret_access_key"],
                endpoint_url=gcs_credentials["endpoint_url"],
            )
        )
    )

Operating system

Linux

Runtime environment

Local

Python version

3.12

dlt data source

No response

dlt destination

Google BigQuery

Other deployment details

No response

Additional information

Workaround:

Change the filesystem protocol from s3 to gs:

# config.toml
[destination.filesystem]
bucket_url = "gs://<my-bucket>/<my-prefix>"

Use GcpServiceAccountCredentials instead of AwsCredentials in the filesystem destination:

# my_pipeline.py
# gcp_credentials are a subset of service account JSON credentials produced by Google Cloud,
# namely `project_id`, `private_key_id`, `private_key`, and `client_email`
pipe = dlt.pipeline(
        destination=dlt.destinations.bigquery(
            credentials=GcpServiceAccountCredentials(**gcp_credentials)
        ),
        staging=dlt.destinations.filesystem(
            credentials=GcpServiceAccountCredentials(**gcp_credentials)
        )
    )

The text was updated successfully, but these errors were encountered:

rudolfix · 2025-03-09T12:31:31Z

@trymzet AFAIK it is not possible to pass s3 location to a BigQuery load job. and yes - you can share the same staging configuration with less specific config section (exactly as you do) but that need to be the same config. in your case you have two separate locations on two separate buckets.

I probably do not understand what you want to achieve here :)

github-project-automation bot added this to dlt core library Mar 6, 2025

github-project-automation bot moved this to Todo in dlt core library Mar 6, 2025

rudolfix self-assigned this Mar 9, 2025

rudolfix added the question Further information is requested label Mar 9, 2025

rudolfix moved this from Todo to In Progress in dlt core library Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery destination fails to use GCS as staging filesystem #2384

BigQuery destination fails to use GCS as staging filesystem #2384

trymzet commented Mar 6, 2025 •

edited

Loading

rudolfix commented Mar 9, 2025

BigQuery destination fails to use GCS as staging filesystem #2384

BigQuery destination fails to use GCS as staging filesystem #2384

Comments

trymzet commented Mar 6, 2025 • edited Loading

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

rudolfix commented Mar 9, 2025

trymzet commented Mar 6, 2025 •

edited

Loading