Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"dlt pipeline drop" modifies pipeline state of unrelated resources #2408

Open
nsnider-fabric opened this issue Mar 14, 2025 · 1 comment
Open
Labels
bug Something isn't working

Comments

@nsnider-fabric
Copy link

dlt version

dlt 1.8.1

Describe the problem

When running dlt pipeline drop against a resource, I'm seeing that the last_value and initial_value keys of unrelated resources in the state file are being modified.

Specifically, values that used to have a timezone offset have the offset stripped in the new state that is saved.

Originally reported via slack here - https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1741639831133249

Expected behavior

When dlt pipeline drop is executed against a resource, state is modified ONLY for the resource(s) specified in the CLI command.

Steps to reproduce

  1. Run the following python script
import dlt
import pendulum

@dlt.resource(
        incremental=dlt.sources.incremental('updated_at', initial_value=pendulum.parse('2024-01-01T00:00:00Z'))
)
def table_1():
    yield [{'id': 1, 'updated_at': pendulum.parse('2024-01-02T00:00:00Z')}]

@dlt.resource(
        incremental=dlt.sources.incremental('updated_at', initial_value=pendulum.parse('2024-01-01T00:00:00Z'))
)
def table_2():
    yield [{'id': 2, 'updated_at': pendulum.parse('2024-01-03T00:00:00Z')}]


pipeline = dlt.pipeline(
    pipeline_name='test_pipeline',
    dataset_name='public',
    destination='duckdb',
    progress='log'
)

pipeline.run([table_1(), table_2()])
  1. Inspect your local state file, and note that the last_value and initial_value timestamps have an offset specified. It will look something like this
    "sources": {
        "test": {
            "resources": {
                "table_2": {
                    "incremental": {
                        "updated_at": {
                            "initial_value": "2024-01-01T00:00:00+00:00",
                            "last_value": "2024-01-03T00:00:00+00:00",
                            "unique_hashes": [
                                "v6Hp1MgTA5X5wSy0/XZv"
                            ]
                        }
                    }
                },
                "table_1": {
                    "incremental": {
                        "updated_at": {
                            "initial_value": "2024-01-01T00:00:00+00:00",
                            "last_value": "2024-01-02T00:00:00+00:00",
                            "unique_hashes": [
                                "v2qrZ/O4+7dTvB5kU30n"
                            ]
                        }
                    }
                }
            }
        }
    }
  1. drop table_1 using the dlt cli
    dlt pipeline test_pipeline drop --destination duckdb --dataset public table_1

  2. Look at the new state, and note that the initial_value and last_value timestamps for table_2 now have their offsets stripped:

    "sources": {
        "test": {
            "resources": {
                "table_2": {
                    "incremental": {
                        "updated_at": {
                            "initial_value": "2024-01-01T00:00:00",
                            "last_value": "2024-01-03T00:00:00",
                            "unique_hashes": [
                                "v6Hp1MgTA5X5wSy0/XZv"
                            ]
                        }
                    }
                }
            }
        }
    }

Then, next time you run the pipeline, you'll get a fatal error that looks something like this:

<class 'dlt.extract.incremental.exceptions.IncrementalCursorInvalidCoercion'>
In processing pipe table_2: Could not coerce start_value/initial_value with value 2024-01-03 00:00:00 and type <class 'pendulum.datetime.DateTime'> to actual data item 2024-01-03 00:00:00+00:00 at path updated_at with type DateTime: can't compare offset-naive and offset-aware datetimes. You need to use different data type for start_value/initial_value or cast your data ie. by using `add_map` on this resource.

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

DuckDB

Other deployment details

No response

Additional information

Issue has been replicated with duckdb and redshift destination

@sh-rp sh-rp added the bug Something isn't working label Mar 16, 2025
@sh-rp
Copy link
Collaborator

sh-rp commented Mar 16, 2025

@nsnider-fabric tanks for opening this and the nice and simple repro. I think we will have to look at this in the next while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

2 participants