[SPARK-54424][SQL] Failures during recaching must not fail operations #53143

aokolnychyi · 2025-11-20T19:56:58Z

What changes were proposed in this pull request?

This PR prevents failures during recaching failing write/refresh operations.

Why are the changes needed?

After recent changes in SPARK-54387, we may now mark write operations as failed even though they successfully committed to the table but the cache refresh was unsuccessful.

Does this PR introduce any user-facing change?

Yes, recacheByXXX will no longer throw an exception if recaching fails.

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

aokolnychyi · 2025-11-20T20:01:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+    try {
      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(spark)
      val (newKey, newCache) = sessionWithConfigsOff.withActive {
        val refreshedPlan = V2TableRefreshUtil.refresh(cd.plan)


In 4.1, we added this line to refresh versions. This refresh MUST NOT fail writes.

val refreshedPlan = V2TableRefreshUtil.refresh(cd.plan)

I am not sure how we want to treat failures from the line below. Previously, this threw an exception and potentially marked writes as failed if we couldn't refresh.

val qe = sessionWithConfigsOff.sessionState.executePlan(refreshedPlan)

The current implementation will NOT throw an exception in either of the cases. Other options:

Don't fail only if refresh fails. Continue to fail if QE construction fails.

Don't fail in either case but only for DSv2 operations.

@cloud-fan @gengliangwang @dongjoon-hyun @viirya @szehon-ho, thoughts?

It does seem like all our current invocations treat cache refresh as opportunistic. In other words, it is usually the last step where it is critical to remove the old entry but reaching may or may not succeed. Is that the same understanding that everyone has? Any cases I missed?

@dongjoon-hyun, following up on the question here. I see only one place that potentially calls recache on read and it is related to AQE. Usually, only write or REFRESH operations trigger reaching.

private def buildBuffers(): RDD[CachedBatch] = { val cb = try { if (supportsColumnarInput) { serializer.convertColumnarBatchToCachedBatch( cachedPlan.executeColumnar(), cachedPlan.output, storageLevel, cachedPlan.conf) } else { serializer.convertInternalRowToCachedBatch( cachedPlan.execute(), cachedPlan.output, storageLevel, cachedPlan.conf) } } catch { case e: Throwable if cachedPlan.isInstanceOf[AdaptiveSparkPlanExec] => // SPARK-49982: during RDD execution, AQE will execute all stages except ResultStage. If any // failure happen, the failure will be cached and the next SQL cache caller will hit the // negative cache. Therefore we need to recache the plan. val session = cachedPlan.session session.sharedState.cacheManager.recacheByPlan(session, logicalPlan) throw e }

dongjoon-hyun

For write operation, I got it. But, for read operation, is this valid to swallow the error, @aokolnychyi ?

dongjoon-hyun · 2025-11-20T20:11:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+      Some(cd.copy(plan = newKey, cachedRepresentation = newCache))
+    } catch {
+      case NonFatal(e) =>
+        logWarning(log"Failed to recache query", e)


I'm worrying about the side-effect of this part.

I worry too, that's why I want everyone to take a look.

cloud-fan · 2025-11-21T04:14:57Z

This is a tricky case and I'd like to understand more context:

what's the behavior of DML cache refreshing for v1 tables today? do we allow schema change and always rebuild the query plan with latest table version?
I think it makes sense to fail if a read query detect incompatible table change after the plan is analyzed, but for DML cache refreshing, it doesn't matter and seems ok to always use the latest table version.

gengliangwang · 2025-11-21T06:18:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+    try {
      val sessionWithConfigsOff = getOrCloneSessionWithConfigsOff(spark)
      val (newKey, newCache) = sessionWithConfigsOff.withActive {
        val refreshedPlan = V2TableRefreshUtil.refresh(cd.plan)


If we want to avoid failure here, how about skip query plan validation in the method call here instead?

val refreshedPlan = V2TableRefreshUtil.refresh(cd.plan, validation = false)

and skip the following in V2TableRefreshUtil.refresh

validateTableIdentity(currentTable, r) validateDataColumns(currentTable, r) validateMetadataColumns(currentTable, r)

holdenk · 2025-11-21T18:56:19Z

Is this a release blocker / regression?

dongjoon-hyun · 2025-11-23T02:32:14Z

Gentle ping, @aokolnychyi .

[SPARK-54424][SQL] Failures during recaching must not fail operations

0a1eb7c

github-actions bot added the SQL label Nov 20, 2025

aokolnychyi commented Nov 20, 2025

View reviewed changes

dongjoon-hyun reviewed Nov 20, 2025

View reviewed changes

gengliangwang reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54424][SQL] Failures during recaching must not fail operations #53143

[SPARK-54424][SQL] Failures during recaching must not fail operations #53143

aokolnychyi commented Nov 20, 2025

Uh oh!

aokolnychyi Nov 20, 2025 •

edited

Loading

Uh oh!

aokolnychyi Nov 20, 2025

Uh oh!

aokolnychyi Nov 20, 2025

Uh oh!

aokolnychyi Nov 20, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun Nov 20, 2025

Uh oh!

aokolnychyi Nov 20, 2025

Uh oh!

cloud-fan commented Nov 21, 2025

Uh oh!

gengliangwang Nov 21, 2025

Uh oh!

holdenk commented Nov 21, 2025

Uh oh!

dongjoon-hyun commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-54424][SQL] Failures during recaching must not fail operations #53143

Are you sure you want to change the base?

[SPARK-54424][SQL] Failures during recaching must not fail operations #53143

Conversation

aokolnychyi commented Nov 20, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

aokolnychyi Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 21, 2025

Uh oh!

gengliangwang Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

holdenk commented Nov 21, 2025

Uh oh!

dongjoon-hyun commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aokolnychyi Nov 20, 2025 •

edited

Loading