Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table corruption using lock-free Hive commits #11814

Open
1 of 3 tasks
sauliusvl opened this issue Dec 18, 2024 · 1 comment
Open
1 of 3 tasks

Table corruption using lock-free Hive commits #11814

sauliusvl opened this issue Dec 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@sauliusvl
Copy link

Apache Iceberg version

1.6.1

Query engine

Spark

Please describe the bug 🐞

We observed the following situation happen a few times now when using lock-free Hive catalog commits introduced in #6570:

We run an ALTER TABLE table SET TBLPROPERTIES ('key' = 'value') or any other operation that results in an Iceberg commit, either Spark or any other engine. For whatever reason the connection to the Hive metastore is broken and the HMS operation fails during the first attempt:

WARN org.apache.hadoop.hive.metastore.RetryingMetaStoreClient: MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. alter_table_with_environmentContext
org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset
<...>
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_environment_context(ThriftHiveMetastore.java:1693)
<...>
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:169)
<...>
at org.apache.iceberg.hive.MetastoreUtil.alterTable(MetastoreUtil.java:78)
at org.apache.iceberg.hive.HiveOperationsBase.lambda$persistTable$0(HiveOperationsBase.java:112)
<...>
at org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:239)
at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
<...>
at org.apache.iceberg.spark.SparkCatalog.alterTable(SparkCatalog.java:345)
<...>

but the operation actually succeeds and updates the metadata location, which means that when the RetryingMetaStoreClient attempts resubmitting the operation, it fails with:

MetaException(message:The table has been modified. The parameter value for key 'metadata_location' is '<new>'. The expected was value was '<previous>')

The Iceberg commit is then considered failed and the new metadata file is cleaned up in the finally block here before retrying the commit. But the problem is that the Hive table has the new metadata location set, so when Iceberg tries refreshing the table it fails, because the new metadata file no longer exists, leaving the table in a corrupted state.

I suppose a fix could be checking the exception and ignoring the case when the already set location is equal to the new metadata location, but parsing the error message sounds very hacky.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@sauliusvl sauliusvl added the bug Something isn't working label Dec 18, 2024
@sauliusvl
Copy link
Author

@pvary: maybe you have some insight here, i.e. what would be the best way to address it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant