Overwrite with Filter Conditions Example - Large Amount of Filter Conditions #1571

lelandroling · 2025-01-24T16:21:52Z

Question

Checking through the GitHub issues, I noticed very few examples and I did see the open requests for improved documentation. Understandably, I understand that I can use MERGE INTO using Pyspark. My specific example is attempting to avoid the large overhead of Pyspark, but if that's the solution... ok. But before I walk down that path, I'm trying to understand how the use case looks for the .overwrite() and overwrite_filter.

conditions = []
for row in values:
     row_condition = And(*[EqualTo(k, v) for k, v in zip(newKeys, row)])
     conditions.append(row_condition)

filter_condition = Or(*conditions)

I'm using this code to build out the filter_condition, then assigning that to overwrite_filter. What I've noticed is that if I have 1000 records, I'm hitting a maximum recursion error. My assumption is that I'm not understanding how to structure the filter_condition. Or the process can't handle this right now and I should move to MERGE INTO and Pyspark.

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2025-01-26T21:32:10Z

I'm using this code to build out the filter_condition, then assigning that to overwrite_filter. What I've noticed is that if I have 1000 records, I'm hitting a maximum recursion error

Thanks for raising this issue! I've heard mentions of this issue many times before, specifically related to filters and max recursion error.

Could you include more information on the filter conditions and perhaps a stack trace of the error?

My hypothesis is this is related to the size of the filter condition and not the size of the underlying data

corleyma · 2025-01-27T20:07:09Z

I think the problem here (and in other mentions I've seen of this) is that folks are attempting to create row-level overwrite filters, which is not what this API is really for. We can/should probably fix the recursion error (by changing the code not to be recursive, probably), but it still seems like a smell that people are thinking about things incorrectly. Ultimately the filter conditions should identify which partitions/data files need changes, not which rows.

Using this as a crude replacement for Merge Into requires you to understand your data layout well and how iceberg works in general, so I don't think we should be advising it in the general case.

lelandroling · 2025-02-10T16:55:36Z

@kevinjqliu The newKeys variable is filled with a few keys in my example, so for sake of the use case:

let newKeys = ['columnA', 'columnB', 'columnC']
The values array simply has the values of each row for columnA, columnB, and columnC. These columns are only the key values of the table, so I'm effectively making conditions to say overwrite the row when it meets columnA = rowValueA AND columnB = rowvalueB, etc...

@corleyma Fair enough. We can use the MERGE INTO process. I was somewhat married to the idea of getting away from the pyspark dependency, but it does work. Thanks for the answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overwrite with Filter Conditions Example - Large Amount of Filter Conditions #1571

Overwrite with Filter Conditions Example - Large Amount of Filter Conditions #1571

lelandroling commented Jan 24, 2025 •

edited

Loading

kevinjqliu commented Jan 26, 2025

corleyma commented Jan 27, 2025

lelandroling commented Feb 10, 2025

Overwrite with Filter Conditions Example - Large Amount of Filter Conditions #1571

Overwrite with Filter Conditions Example - Large Amount of Filter Conditions #1571

Comments

lelandroling commented Jan 24, 2025 • edited Loading

Question

kevinjqliu commented Jan 26, 2025

corleyma commented Jan 27, 2025

lelandroling commented Feb 10, 2025

lelandroling commented Jan 24, 2025 •

edited

Loading