Table corruption using lock-free Hive commits #11814

sauliusvl · 2024-12-18T20:34:57Z

Apache Iceberg version

1.6.1

Query engine

Spark

Please describe the bug 🐞

We observed the following situation happen a few times now when using lock-free Hive catalog commits introduced in #6570:

We run an ALTER TABLE table SET TBLPROPERTIES ('key' = 'value') or any other operation that results in an Iceberg commit, either Spark or any other engine. For whatever reason the connection to the Hive metastore is broken and the HMS operation fails during the first attempt:

WARN org.apache.hadoop.hive.metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. alter_table_with_environmentContext
org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
<...>
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_environment_context(ThriftHiveMetastore.java:1693)
<...>
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:169)
<...>
at org.apache.iceberg.hive.MetastoreUtil.alterTable(MetastoreUtil.java:78)
at org.apache.iceberg.hive.HiveOperationsBase.lambda$persistTable$0(HiveOperationsBase.java:112)
<...>
at org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:239)
at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
<...>
at org.apache.iceberg.spark.SparkCatalog.alterTable(SparkCatalog.java:345)
<...>

but the operation actually succeeds and updates the metadata location, which means that when the RetryingMetaStoreClient attempts resubmitting the operation, it fails with:

MetaException(message:The table has been modified. The parameter value for key 'metadata_location' is '<new>'. The expected was value was '<previous>')

The Iceberg commit is then considered failed and the new metadata file is cleaned up in the finally block here before retrying the commit. But the problem is that the Hive table has the new metadata location set, so when Iceberg tries refreshing the table it fails, because the new metadata file no longer exists, leaving the table in a corrupted state.

I suppose a fix could be checking the exception and ignoring the case when the already set location is equal to the new metadata location, but parsing the error message sounds very hacky.

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

sauliusvl · 2024-12-20T07:31:04Z

@pvary: maybe you have some insight here, i.e. what would be the best way to address it?

sauliusvl added the bug Something isn't working label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table corruption using lock-free Hive commits #11814

Table corruption using lock-free Hive commits #11814

sauliusvl commented Dec 18, 2024

sauliusvl commented Dec 20, 2024

Table corruption using lock-free Hive commits #11814

Table corruption using lock-free Hive commits #11814

Comments

sauliusvl commented Dec 18, 2024

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

sauliusvl commented Dec 20, 2024