Skip to content

Conversation

@qw4990
Copy link
Contributor

@qw4990 qw4990 commented Jan 9, 2026

What problem does this PR solve?

Issue Number: close #65378

Problem Summary: planner: support cluster-level binding reload command

What changed and how does it work?

Currently, the ADMIN RELOAD BINDINGS command only reloads SQL bindings on the TiDB node where the command is executed. In a cluster environment with multiple TiDB nodes, users need to manually execute the command on each node to reload bindings across the entire cluster. This is inconvenient and error-prone, especially when bindings are modified and need to be synchronized across all nodes.

This PR introduces a new ADMIN RELOAD CLUSTER BINDINGS command that reloads SQL bindings across all TiDB nodes in the cluster with a single command execution.

Key changes:

  1. Parser support: Added AdminReloadClusterBindings AST node type and parsing support for the new ADMIN RELOAD CLUSTER BINDINGS command in pkg/parser/ast/ and pkg/parser/.

  2. Plan builder: Added OpReloadClusterBindings operation type in pkg/planner/core/common_plans.go and handled the new command in pkg/planner/core/planbuilder.go.

  3. Executor implementation: Implemented reloadClusterBindings() method in pkg/executor/bind.go that uses the existing broadcast() function to send the reload command to all TiDB nodes in the cluster, including the current node.

  4. Logging enhancement: Added logging to bindingCacheUpdater.LoadFromStorageToCache() in pkg/bindinfo/binding_cache.go to track the number of bindings loaded, cache size, and duration during full load operations.

How it works:

When a user executes ADMIN RELOAD CLUSTER BINDINGS, the command is broadcast to all TiDB nodes in the cluster using the coprocessor framework's broadcast mechanism (similar to how REFRESH STATS CLUSTER works). Each TiDB node receives the broadcast and executes ADMIN RELOAD BINDINGS locally, which triggers LoadFromStorageToCache(true) to reload all bindings from storage into the local cache.

The broadcast mechanism ensures that:

  • The command reaches all TiDB nodes in the cluster
  • Each node reloads its local binding cache from storage
  • The operation is atomic at the cluster level (all nodes reload simultaneously)

Test:

Since the test of this PR requires an entire cluster environment and some interaction between different TiDB nodes, so it's hard to write a UT for it. Then I just tested it manually, please see the picture below:

  1. I create a table and a binding.
  2. In the first "admin reload bindings", only the node executing this command loads all bindings;
  3. Then in the second "admin reload cluster bindings", all nodes load all bindings;
  4. Then I create a new binding, and run "admin reload cluster bindings" again, all nodes load 2 new bindings.
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 9, 2026
@tiprow
Copy link

tiprow bot commented Jan 9, 2026

Hi @qw4990. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

❌ Patch coverage is 53.12500% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.5705%. Comparing base (28cce6a) to head (59ceb6b).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #65509        +/-   ##
================================================
+ Coverage   77.8398%   79.5705%   +1.7307%     
================================================
  Files          1971       1895        -76     
  Lines        540836     528806     -12030     
================================================
- Hits         420986     420774       -212     
+ Misses       118190     106599     -11591     
+ Partials       1660       1433       -227     
Flag Coverage Δ
integration 47.5533% <40.6779%> (-0.6462%) ⬇️
unit 76.7587% <53.1250%> (+0.3018%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 65.7217% <ø> (+4.6046%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@qw4990
Copy link
Contributor Author

qw4990 commented Jan 9, 2026

/test unit-test

@tiprow
Copy link

tiprow bot commented Jan 9, 2026

@qw4990: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@qw4990
Copy link
Contributor Author

qw4990 commented Jan 12, 2026

/test check-dev2

@tiprow
Copy link

tiprow bot commented Jan 12, 2026

@qw4990: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test check-dev2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@AilinKid AilinKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

}

sum = p.BasePhysicalPlan.MemoryUsage() + p.Inner.MemoryUsage()
sum = p.BasePhysicalPlan.MemoryUsage()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base.Plan don't have the memory usage interface, should we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that base.Plan doesn't have a MemoryUsage function. And if we want to add MemoryUsage to base.Plan, then we have to implement this function for all LogicalPlan, which is very complex... So to make the code simple, I guess the best way is to just ignore it here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a comment here to notify that this may cause the memory usage to be inaccurate. And explain why it is acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I believe we can still try to reuse the simple here. We just need to add SimpleExec.executeAdmin to accomplish this. Then, we won't need to hard-code the handling of this specific admin command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline discussed, thanks!

bindingLogger().Info("load bindings", zap.Bool("fullLoad", fullLoad),
zap.Bool("cacheSizeChange", cacheSizeChange), zap.Bool("hasNewBinding", hasNewBinding),
zap.Int64("cacheCapacity", u.GetMemCapacity()), zap.Int64("cacheUsage", u.GetMemUsage()),
zap.Int64("cachedBindingNum", int64(u.Size())), zap.Duration("duration", time.Since(begin)), zap.Error(err))
Copy link
Contributor

@AilinKid AilinKid Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't unit test literally, we should have one more test for
two instance, one has mem-quota= small, the other one has memo-quota big
then trigger the admin reload clusters bindings, if one can not load them all, better show some log or warnings, cause it will cause some in-consistence behavior even after this strong synchronization command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently tidb_mem_quota_binding_cache is a global variable, all instances should have the same memory quota. But your comment just reminds me that we need to log if some values are evicted from the cache. I added some new log for this:
image

Comment on lines 62 to 65
if u.GetMemCapacity() != vardef.MemQuotaBindingCache.Load() {
cacheSizeChange = true
u.SetMemCapacity(vardef.MemQuotaBindingCache.Load())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between the comparison and SetMemCapacity(), another goroutine could modify the global variable. Should we cache the loaded value in a local variable first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I've updated it.

type PhysicalSimpleWrapper struct {
physicalop.BasePhysicalPlan
Inner Simple
Inner base.Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of PhysicalSimpleWrapper is to wrap Simple (which is actually a LogicalPlan) to a PhysicalPlan. Now I want to also use it to wrap Admin to a PhysicalPlan, so I just modify its type from Simple to base.Plan.
And I renamed this structure from PhysicalSimpleWarpper to PhysicalPlanWrapper, which means it can wrap any kind of plan to a PhysicalPlan.

Copy link
Member

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I am a bit confused. How do you distinguish between the admin SQL from remote and the user? I think you should rely on IsFromRemote to make that distinction. Also, please include a manual test in your PR to ensure it works well with multiple instances.

return
}

sum = s.SimpleSchemaProducer.MemoryUsage() + size.SizeOfInterface + size.SizeOfBool + size.SizeOfUint64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the prior implementation is not complete, and it's not used any longer, so we can just remove it.

@qw4990
Copy link
Contributor Author

qw4990 commented Jan 12, 2026

Thanks!

I am a bit confused. How do you distinguish between the admin SQL from remote and the user? I think you should rely on IsFromRemote to make that distinction. Also, please include a manual test in your PR to ensure it works well with multiple instances.

offline discussed

Copy link
Member

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

return cmpResult > 0
})
for _, hint := range bindings {
if hint == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add some comments explaining when this will happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

Comment on lines 310 to 312
if innerPlan == nil {
errors.Errorf("unexpected statement %s in broadcast query", *e.BroadcastQuery.Query)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this one can be in the default branch. And it seems the error is not used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jan 13, 2026
Copy link
Contributor

@fixdb fixdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AilinKid, fixdb
Once this PR has been reviewed and has the lgtm label, please assign d3hunter, yudongusa for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 13, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 13, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-01-13 03:52:29.769886591 +0000 UTC m=+329593.831751504: ☑️ agreed by AilinKid.
  • 2026-01-13 04:50:59.455647466 +0000 UTC m=+333103.517512375: ☑️ agreed by fixdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

planner: support cluster-level "admin reload bindings"

4 participants