Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery destination fails to use GCS as staging filesystem #2384

Open
trymzet opened this issue Mar 6, 2025 · 1 comment
Open

BigQuery destination fails to use GCS as staging filesystem #2384

trymzet opened this issue Mar 6, 2025 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@trymzet
Copy link
Contributor

trymzet commented Mar 6, 2025

dlt version

1.5.0

Describe the problem

BigQuery fails when trying to use the same filesystem config which works when I use ClickHouse as destination, ie. GCS in s3 compatibility mode (this setting is required by the ClickHouse destination). I get this error:

google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/<my-project>/jobs?prettyPrint=false: Source URI must be a Google Cloud Storage location: s3://<my_bucket>/<my_path>/<my_file>.parquet

It seems this should be working, according to #1272

Expected behavior

I should be able to reuse the same staging config across any destination.

Steps to reproduce

# config.toml
[destination.filesystem]
bucket_url = "s3://<my-bucket>/<my-prefix>"
# my_pipeline.py
# (endpoint_url is set to "https://storage.googleapis.com")
pipe = dlt.pipeline(
        destination=dlt.destinations.bigquery(
            credentials=GcpServiceAccountCredentials(**gcp_credentials)
        ),
        staging=dlt.destinations.filesystem(
            credentials=AwsCredentials(
                aws_access_key_id=gcs_credentials["aws_access_key_id"],
                aws_secret_access_key=gcs_credentials["aws_secret_access_key"],
                endpoint_url=gcs_credentials["endpoint_url"],
            )
        )
    )

Operating system

Linux

Runtime environment

Local

Python version

3.12

dlt data source

No response

dlt destination

Google BigQuery

Other deployment details

No response

Additional information

Workaround:

  1. Change the filesystem protocol from s3 to gs:
# config.toml
[destination.filesystem]
bucket_url = "gs://<my-bucket>/<my-prefix>"
  1. Use GcpServiceAccountCredentials instead of AwsCredentials in the filesystem destination:
# my_pipeline.py
# gcp_credentials are a subset of service account JSON credentials produced by Google Cloud,
# namely `project_id`, `private_key_id`, `private_key`, and `client_email`
pipe = dlt.pipeline(
        destination=dlt.destinations.bigquery(
            credentials=GcpServiceAccountCredentials(**gcp_credentials)
        ),
        staging=dlt.destinations.filesystem(
            credentials=GcpServiceAccountCredentials(**gcp_credentials)
        )
    )
@rudolfix rudolfix self-assigned this Mar 9, 2025
@rudolfix rudolfix added the question Further information is requested label Mar 9, 2025
@rudolfix
Copy link
Collaborator

rudolfix commented Mar 9, 2025

@trymzet AFAIK it is not possible to pass s3 location to a BigQuery load job. and yes - you can share the same staging configuration with less specific config section (exactly as you do) but that need to be the same config. in your case you have two separate locations on two separate buckets.

I probably do not understand what you want to achieve here :)

@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Mar 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: In Progress
Development

No branches or pull requests

2 participants