-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overwrite with Filter Conditions Example - Large Amount of Filter Conditions #1571
Comments
Thanks for raising this issue! I've heard mentions of this issue many times before, specifically related to filters and max recursion error. Could you include more information on the filter conditions and perhaps a stack trace of the error? My hypothesis is this is related to the size of the filter condition and not the size of the underlying data |
I think the problem here (and in other mentions I've seen of this) is that folks are attempting to create row-level overwrite filters, which is not what this API is really for. We can/should probably fix the recursion error (by changing the code not to be recursive, probably), but it still seems like a smell that people are thinking about things incorrectly. Ultimately the filter conditions should identify which partitions/data files need changes, not which rows. Using this as a crude replacement for Merge Into requires you to understand your data layout well and how iceberg works in general, so I don't think we should be advising it in the general case. |
@kevinjqliu The newKeys variable is filled with a few keys in my example, so for sake of the use case:
@corleyma Fair enough. We can use the MERGE INTO process. I was somewhat married to the idea of getting away from the pyspark dependency, but it does work. Thanks for the answer. |
Question
Checking through the GitHub issues, I noticed very few examples and I did see the open requests for improved documentation. Understandably, I understand that I can use MERGE INTO using Pyspark. My specific example is attempting to avoid the large overhead of Pyspark, but if that's the solution... ok. But before I walk down that path, I'm trying to understand how the use case looks for the
.overwrite()
andoverwrite_filter
.I'm using this code to build out the filter_condition, then assigning that to overwrite_filter. What I've noticed is that if I have 1000 records, I'm hitting a maximum recursion error. My assumption is that I'm not understanding how to structure the filter_condition. Or the process can't handle this right now and I should move to MERGE INTO and Pyspark.
The text was updated successfully, but these errors were encountered: