Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite with Filter Conditions Example - Large Amount of Filter Conditions #1571

Open
lelandroling opened this issue Jan 24, 2025 · 3 comments

Comments

@lelandroling
Copy link

lelandroling commented Jan 24, 2025

Question

Checking through the GitHub issues, I noticed very few examples and I did see the open requests for improved documentation. Understandably, I understand that I can use MERGE INTO using Pyspark. My specific example is attempting to avoid the large overhead of Pyspark, but if that's the solution... ok. But before I walk down that path, I'm trying to understand how the use case looks for the .overwrite() and overwrite_filter.

conditions = []
for row in values:
     row_condition = And(*[EqualTo(k, v) for k, v in zip(newKeys, row)])
     conditions.append(row_condition)

filter_condition = Or(*conditions)

I'm using this code to build out the filter_condition, then assigning that to overwrite_filter. What I've noticed is that if I have 1000 records, I'm hitting a maximum recursion error. My assumption is that I'm not understanding how to structure the filter_condition. Or the process can't handle this right now and I should move to MERGE INTO and Pyspark.

@kevinjqliu
Copy link
Contributor

I'm using this code to build out the filter_condition, then assigning that to overwrite_filter. What I've noticed is that if I have 1000 records, I'm hitting a maximum recursion error

Thanks for raising this issue! I've heard mentions of this issue many times before, specifically related to filters and max recursion error.

Could you include more information on the filter conditions and perhaps a stack trace of the error?

My hypothesis is this is related to the size of the filter condition and not the size of the underlying data

@corleyma
Copy link

I think the problem here (and in other mentions I've seen of this) is that folks are attempting to create row-level overwrite filters, which is not what this API is really for. We can/should probably fix the recursion error (by changing the code not to be recursive, probably), but it still seems like a smell that people are thinking about things incorrectly. Ultimately the filter conditions should identify which partitions/data files need changes, not which rows.

Using this as a crude replacement for Merge Into requires you to understand your data layout well and how iceberg works in general, so I don't think we should be advising it in the general case.

@lelandroling
Copy link
Author

@kevinjqliu The newKeys variable is filled with a few keys in my example, so for sake of the use case:

let newKeys = ['columnA', 'columnB', 'columnC']
The values array simply has the values of each row for columnA, columnB, and columnC. These columns are only the key values of the table, so I'm effectively making conditions to say overwrite the row when it meets columnA = rowValueA AND columnB = rowvalueB, etc...

@corleyma Fair enough. We can use the MERGE INTO process. I was somewhat married to the idea of getting away from the pyspark dependency, but it does work. Thanks for the answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants