Feature: MERGE/Upsert Support #1534

mattmartin14 · 2025-01-17T16:03:37Z

Hi,

This is my first PR to the pyiceberg project, so I hope it is well received and I’m open to any feedback. This feature has been long asked by the pyiceberg community to be able to perform an “upsert” on an iceberg table a.k.a. MERGE command. There is a lot to unpack on this PR, so I’ll start with some high level items and go into more detail.

For basic terminology, an upsert/merge is where you take a new dataset and merge it to the target Iceberg table by doing an update on the target table for rows that have changed and an insert for new rows. This is all performed atomically as 1 transaction, meaning both the update and the insert succeed together, or they fail; this ensures the table is not left in an unknown state if one of the actions were able to succeed but the other errored out.

Over the last decade, the ANSI SQL MERGE command has evolved a lot to handle more than just update existing rows, insert new rows. This PR aims to just cover the BASIC upsert pattern; it has been architected to allow more of the ANSI MERGE command features over time (via the merge_options paramater) to be included if others would like to contribute, such as “WHEN NOT MATCHED BY SOURCE THEN DELETE”.

In order to efficiently process an upsert on the Iceberg table, it requires an engine to detect what rows have changed and what rows are new. The engine I have chosen for this new feature is datafusion, which is a rust based high performance pyarrow data processing engine; I realize this introduces a new dependency for the pyiceberg project, which i've flagged as optional in the pyproject.toml file. My rational on choosing datafusion is this is also the same library that is being used for the iceberg-rust project. Thus, down the road, remediation of this merge command to the more modern iceberg-rust implementation should be minimal.

As far as test coverage is concerned, my unit tests cover performing upserts on both single key and composite key Iceberg tables (meaning the table has more than 1 key field it needs to join on). For the performance testing, single key tables scale just fine. My unit tests does a 10k update/insert for one of the series of tets, but I have on my local workstation (Mac M2 pro) ran a test of 500k rows of updates/inserts on a single key join and it scales just fine.

Where I am hitting a wall on performance and am needing other’s help here is when you want to use a composite key. The composite key code builds an overwrite filter, and once that filter gets too lengthy (in my testing more than 200 rows), the visitor “OR” function in pyiceberg hits a recursion depth error. I took a hard look at the visitor code in the expressions folder and I was hesitant to try and change any of those functions due to me not having a clear understanding of how they work, and this is where I need other’s help. If you want to see this scaling cocern, simply update the paramater on my test_merge_scenario_3_composite_key def to generate a target table with 1000 rows and a source table with 2000 rows and you will see the error surface. I don't think its smart nor pratical to try and change the python recursion depth default limit because we will still hit the wall at some point unless the visitor "OR" function gets reworked to avoid recursion.

Some other ideas i've kicked around to try and mitigate this, is to have the merge_rows code temporarily modify the source dataframe and the target iceberg table to build a concatenated single key of all the composite keys to join on e.g. "col1_col2_col[n]...". But that would require modifying the entire iceberg target table with a full overwrite, which then defeats the purpose of an upsert to be able to run incremental updates on a table not have to overwrite the entire table.

I think this PR is a good first step to finally realizing MERGE/upsert support for pyiceberg and gets a piercing in the armor. Again, I’m welcome to other’s feedback on this topic and look forward to partnering with you all on this journey.

Thanks,
Matt Martin

mattmartin14 · 2025-01-17T18:30:16Z

@kevinjqliu - i'm doing this work on behalf of my company and when i ran my tests, i used a standard python virtual environment venv; i haven't figured out quite yet how to get poetry to work inside my companies firewall. So, not sure if those are errors I can address or if someone else can pitch in here.

bitsondatadev · 2025-01-17T21:32:33Z

@kevinjqliu - i'm doing this work on behalf of my company and when i ran my tests, i used a standard python virtual environment venv; i haven't figured out quite yet how to get poetry to work inside my companies firewall. So, not sure if those are errors I can address or if someone else can pitch in here.

@mattmartin14, what's going on man!? Thanks for working on this and most impressively thanks for the comprehensive description.

Out of curiosity, did you discuss your approach with anyone before putting this together? This is good but a few flags for OSS contributions to lower the upfront back and forth:

My main concern is it looks like you're introducing datafusion to get this feature out and adding dependencies is generally avoided until absolutely necessary (i.e. the value of the new dependency serves many use cases or won't cause a lot of maintenance). If you haven't, I recommend informally discussing this with @Fokko or @kevinjqliu to see if datafusion is already on the roadmap or can be considered. Likely the conversation can finish with them, but they may suggest asking on the mailing list to document and discuss it with more folks. There may already be ways of doing this using the existing library and not adding another dependency.
As you've noticed the build is breaking with the poetry issue and it sounds like this is happening on a work laptop. I recommend working with your company to allow you to work on OSS code outside of a company device as I'm familiar with how locked down their system is and it will be difficult to contribute behind the corporate VPN. I did this for contributions where I submitted the code to my repository on my OSS account (we weren't required to have company accounts so this worked out anyways). This would require a review to ensure sensitive information wasn't leaked. Once I submitted the code to my branch, I was authorized to work on that code on my personal laptop (generally off hours to get things polished for merging).

For Apache projects, you have to add license notifications to each file you add and these files are missing:

 !????? /home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/merge_rows_util.py
 !????? /home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/test.py
 !????? /home/runner/work/iceberg-python/iceberg-python/tests/table/test_merge_rows.py

There's bits of dead code scattered about, I'm not sure if that's for discussion or todo. I would either remove it or comment on it when this PR is ready for review.
Where I am hitting a wall on performance

Aside from total disregard for it, performance should be pretty low on the list of concerns. This was also a new paradigm when I submitted to open source. You're not writing code for a corporate team-sized set of eyes, you're writing it for many people to evaluate. I always recommend doing the bare minimum unless the performant version adds ten or twenty lines of code. Focus on correctness and edge cases first. Second focus on minimalism and clarity. There's 500+ lines of code that people have to get their heads around. That can be justified, but often can be broken up to lower the burden on those reviewing the code and lowers the chances you'll sneak a bug in unintentionally.

Contributing to OSS has a different focus than internal code, so hopefully these help. This does look well thought out in terms of implementation, but performance should be second or third try in favor of having code history that everyone in the community can wrap their heads around.

I'd suggest to get these addressed before Fokko and Kevin scan it. I'll be happy to do a quick glance once the tests are running and there's some consensus around datafusion.

PR number one yeah!

mattmartin14 · 2025-01-18T00:03:17Z

Thanks @bitsondatadev for all this great feedback. I'll get working on your suggestions and push an update next week and will address all your concerns.

kevinjqliu · 2025-01-19T19:29:15Z

Thanks @mattmartin14 for the PR! And thanks @bitsondatadev on the tips on working in OSS. I certainly had to learn a lot of these over the years.

A couple things I think we can address first.

Support for MERGE INTO / Upsert

This has been a much anticipated and asked feature in the community. Issue #402 has been tracking it with many eyes on it. I think we still need to figure out the best approach to support this feature.

Like you mentioned in the description, MERGE INTO is a query engine feature. Pyiceberg itself is a client library to support the Iceberg python ecosystem. Pyiceberg aims to provide the necessary Iceberg building blocks so that other engines/programs can interact with Iceberg tables easily.

As we’re building out more of more engine-like features, it becomes harder to support more complex and data-intensive workloads such as MERGE INTO. We have been able to use pyarrow for query processing but it has its own limitations. For more compute intensive workloads, such as Bucket and Truncate transform, we were able to leverage rust (iceberg-rust) to handle the computation.

Looking at #402, I don’t see any concrete plans on how we can support MERGE INTO. I’ve added this as an agenda on the monthly pyiceberg sync and will post the update. Please join us if you have time!

Taking on Datafusion as a dependency

I’m very interested in exploring datafusion and ways we can leverage it for this project. As I mentioned above, we currently use pyarrow to handle most of the compute. It’ll be interesting to evaluate datafusion as an alternative. Datafusion has its own ecosystem of expression api, dataframe api, and runtime. All of which are good complements to pyiceberg. It has integrations with the rust side as well, something I have started exploring in apache/iceberg-rust#865

That said, I think we need a wider discussion and alignment on how to integrate with datafusion. It’s a good time to start thinking about it! I’ve added this as another discussion item on the monthly sync.

Performance concerns

Compute intensive workloads are generally a bottleneck in python. I am excited for future pyiceberg <> iceberg-rust integration where we can leverage rust to perform those computations.

The composite key code builds an overwrite filter, and once that filter gets too lengthy (in my testing more than 200 rows), the visitor “OR” function in pyiceberg hits a recursion depth error.

This is an interesting observation and I think I’ve seen someone else run into this issue before. We’d want to address this separately. This is something we might want to explore using datafusion’s expression api to replace our own parser.

mattmartin14 · 2025-01-22T19:07:35Z

@kevinjqliu @Fokko @bitsondatadev - the issues should be resolved. I got poetry working in my company's firewall; i've also removed the dead code and added the license headers to each file. please take a look

mattmartin14 · 2025-01-22T19:13:54Z

also - i added datafusion to the poetry toml file and lock and it appears that you all need to resolve the conflict here, as it's not letting me.

mattmartin14 · 2025-01-22T21:31:24Z

Also @kevinjqliu - To address your question on datafusion. When I looked into this feature, I explored these 3 options for an arrow processing engine:

Duckdb
Datafusion
Daft

I ultimately decided that datafusion would make the most sense, given these things it had going:

It's already owned by the Apache foundation. So licensing would be a non-issue
its very light weight and specifically designed to process and query arrow tables
it's rust based and if pyiceberg is ultimately going to be migrated to iceberg-rust one day, the integrations would be easier
The iceberg rust project is already building integrations for it, as seen here.

Hope this helps on how I arrived at that conclusion. Just using native pyarrow to try and process the data would be a very large uphill battle as we would effectively have to build our own data processing engine with it e.g. hash joins, sorting, optimizations, etc. I figured it does not make sense to reinvent the wheel and instead use an engine that is already out there (datafusion) and put it to good use.

I took a look at the attachment you posted for any upcoming meetings for the pyiceberg sync, but did not see any 2025 meetings listed. I'd be glad to attend to discuss this further, if needed.

Thanks,
Matt

Fokko

Thanks for working on this @mattmartin14. There is some work to be done here, mostly because we pull the table into memory, and then perform the operation, which defeats the purpose of Iceberg because we don't use the statistics to optimize the query. I left a few comments

pyiceberg/table/__init__.py

Fokko · 2025-01-28T13:03:49Z

pyiceberg/table/__init__.py

+
+        #register both source and target tables so we can find the deltas to update/append
+        ctx.register_dataset(source_table_name, ds.dataset(df))
+        ctx.register_dataset(target_table_name, ds.dataset(self.scan().to_arrow()))


This is something we want to avoid. This will materialize the whole table into memory, which does not scale well.

Instead, I think it makes much more sense to re-use the existing delete(delete_filter: BooleanExpresssion) to delete existing rows. This will only be one of the files relevant to the query, and avoid a lot of unnecessary compute and IO.

Hi @Fokko - from our meeting yesterday, i kind of have to load all this into a datafusion table in order to identify the deltas. I'm not sure of any other way around this; if you have some examples, then i'd be glad to take a look, but i think what I'm doing is in the best spirit of what a MERGE command should do, considering it has to check on primary key matching rows as well as identify what rows are actually different based on the non-key columns.

Sorry, I jumped a bit to conclusions in my previous comment. Allow me to elaborate.

Even with the to_arrow() that we have above, we pull in the whole table. Based on the Iceberg statistics, we can jump over data that's not needed at all. Instead, I would suggest constructing the predicate, and then pull in the data:

unique_keys = df.select(join_cols).group_by(join_cols).aggregate([]) if len(join_cols) == 1: pred = In(join_cols[0], unique_keys[0].to_pylist()) else: pred = Or(*[ And(*[ EqualTo(col, row[col]) for col in join_cols ]) for row in unique_keys.to_pylist() ]) ctx.register_dataset(target_table_name, ds.dataset(self.scan(row_filter=pred).to_arrow()))

This way we only load in Parquet files that are relevant to the set of keys that we're looking for.

Fokko · 2025-01-28T13:07:36Z

pyiceberg/table/__init__.py

+                    values = [row[join_col] for row in update_recs.to_pylist()]
+                    # if strings are in the filter, we encapsulate with tick marks
+                    formatted_values = [f"'{value}'" if col_type == 'string' else str(value) for value in values]
+                    overwrite_filter = f"{join_col} IN ({', '.join(formatted_values)})"


I see that we're constructing the overwrite filter as a string. The overwrite(..) method will take this string, and parse it into a BooleanExpression. How about constructing a BooleanExpression right away:

Suggested change

overwrite_filter = f"{join_col} IN ({', '.join(formatted_values)})"

overwrite_filter = In(join_col, values)

Can you explain more on this? I thought i was building my overwrite filter expression correctly, given I have to provide a list of keys or composite key columns to scan for.

Happy to elaborate. The SQL based syntax is supported, mostly to make it easier for humans. If you follow the predicate in the code, first at txn.overwrite(update_recs, overwrite_filter), and then we come to the .delete() method:

if isinstance(delete_filter, str): delete_filter = _parse_row_filter(delete_filter)

What we do here, is we take the string, and turn it into a BooleanExpression, as an example:

This converts the string into an In('col', [1,2,3]) method. Since we don't need to use SQL here, we can optimize this by pushing down In(..) directly, and avoid conversion back and forth.

How would this work for a situation where I have a composite key for my primary key?
As an example, a table below has primary key on cust_id and line_item:

cust_id line_item cost
1 1001 30.29
2 2001 20.99

my overwrite_filter from what i understand would have to be:

(cust_id = 1 and line_item = 1001) or (cust_id = 2 and line_item = 2001)...and so forth as more rows are needing to be compared

would your method above for the _parse_row_filter work for composite keys? or for only single key tables?

Fokko · 2025-01-28T13:08:51Z

pyiceberg/table/__init__.py

+
+        Args:
+            df: The input dataframe to merge with the table's data.
+            join_cols: The columns to join on.


We can also do this later, but I think we should make the join_cols optional. Instead, we want to consider rows equal that have the same identifier-field-ids, see: https://iceberg.apache.org/spec/?column-projection#identifier-field-ids

I think the join columns need to be provided in order to run a proper merge statement; from what i've seen, you cannot indicate a primary key on an iceberg table; thus the user would need to supply what you should match rows on.

The primary-key equivalent of Iceberg is the identifier fields, so we could also get it from the table like this:

if join_cols is None: identifier_field_ids = self.schema().identifier_field_ids if len(identifier_field_ids) > 0: join_cols = [ self.schema().find_column_name(identifier_field_id) for identifier_field_id in identifier_field_ids ] else: raise ValueError("The table doesn't have identifier fields, please set join_cols.")

We can also do this in a follow-up PR.

so, what if a user had constructed an iceberg table and not specified the "identifier fields". Additionally, the incoming dataset to merge is not int he iceberg format, so it would not have identifier fields avaialble yet to compare.

Does that makes sense?

mattmartin14 · 2025-01-29T13:47:46Z

Hi All,
I've pushed a new update through and asked for some clarrification from @Fokko on some of his suggestions. In terms of the call we had yesterday (1/28/25), I think its great for us to all align on a directional statement on just how much we want to sign up for on pyiceberg managing merge commands.

I think if we tackle the basic merge/upsert pattern (when matched update all, when not matched insert all), that would cover 90% of most merge use cases. For things that require a more involved upsert using multiple matched predicates, we should direct the users to use spark sql, since that is already baked in to the platform.

If anyone disagrees with that directional statement, please let me know.

Fokko · 2025-01-30T12:04:59Z

pyiceberg/table/__init__.py

@@ -1064,6 +1064,119 @@ def name_mapping(self) -> Optional[NameMapping]:
        """Return the table's field-id NameMapping."""
        return self.metadata.name_mapping()

+    def merge_rows(self, df: pa.Table, join_cols: list


What do you of calling this merge instead? I think this aligns better with append and overwrite.

Suggested change

def merge_rows(self, df: pa.Table, join_cols: list

def merge(self, df: pa.Table, join_cols: list

@Fokko i chose "merge_rows" because when i looked through the table init.py file, i saw a lot of areas that had "merge" in it already e.g. "manifest_merge_enabled", "merge_append()", etc.

I figured it would be better to be more verbose in this case to say "merge_rows" vs. just "merge" as that might confuse others later when the look through the code.

Do you agree with this or do you still think we should change the function name to "merge" ?

mattmartin14 and others added 10 commits January 14, 2025 11:15

test

fccb74b

unit testing

7298589

adding unit tests

25bc9cf

adding unit tests

af6c868

adding unit tests

94be807

adding unit tests

269d9f5

adding unit tests

f44c61a

finished unit tests

a96fdf9

removed unnecesary return

fa5ab35

updated poetry manifest list for datafusion package dependency

cfa2277

kevinjqliu self-requested a review January 17, 2025 18:08

Fokko self-requested a review January 17, 2025 19:20

added license headers, cleaned up dead code

35f29be

Fokko requested changes Jan 28, 2025

View reviewed changes

updated the merge function to use bools for matched and not matched rows

6c68d0d

Fokko reviewed Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: MERGE/Upsert Support #1534

Feature: MERGE/Upsert Support #1534

mattmartin14 commented Jan 17, 2025

mattmartin14 commented Jan 17, 2025

bitsondatadev commented Jan 17, 2025 •

edited

Loading

mattmartin14 commented Jan 18, 2025

kevinjqliu commented Jan 19, 2025

mattmartin14 commented Jan 22, 2025

mattmartin14 commented Jan 22, 2025

mattmartin14 commented Jan 22, 2025

Fokko left a comment

Fokko Jan 28, 2025

mattmartin14 Jan 29, 2025

Fokko Jan 30, 2025

Fokko Jan 28, 2025

mattmartin14 Jan 29, 2025

Fokko Jan 30, 2025

mattmartin14 Jan 30, 2025

Fokko Jan 28, 2025

mattmartin14 Jan 29, 2025

Fokko Jan 30, 2025 •

edited

Loading

mattmartin14 Jan 30, 2025

mattmartin14 commented Jan 29, 2025

Fokko Jan 30, 2025

mattmartin14 Jan 30, 2025

	overwrite_filter = f"{join_col} IN ({', '.join(formatted_values)})"
	overwrite_filter = In(join_col, values)

	def merge_rows(self, df: pa.Table, join_cols: list
	def merge(self, df: pa.Table, join_cols: list

Feature: MERGE/Upsert Support #1534

Are you sure you want to change the base?

Feature: MERGE/Upsert Support #1534

Conversation

mattmartin14 commented Jan 17, 2025

mattmartin14 commented Jan 17, 2025

bitsondatadev commented Jan 17, 2025 • edited Loading

mattmartin14 commented Jan 18, 2025

kevinjqliu commented Jan 19, 2025

mattmartin14 commented Jan 22, 2025

mattmartin14 commented Jan 22, 2025

mattmartin14 commented Jan 22, 2025

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattmartin14 commented Jan 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitsondatadev commented Jan 17, 2025 •

edited

Loading

Fokko Jan 30, 2025 •

edited

Loading