Add store maintenance scheduler #9041

Tema · 2025-02-07T00:34:18Z

Feature Request

Describe your feature request related problem

TiKV availability is based on the quorum. Thanks to excellent region balancing across all nodes in the cluster, if any two tikv nodes restart at the same time (in case of a standard replication factor of 3), some data becomes unavailable for the duration of restart.

In a modern cloud environment node replacement is a routine operation. Some companies prefer to restart every node every few weeks to bring latest changes to the runtime environment and ensure that there is no slow leak of resources. Also modern operational practices pursue continous deployments which can result in weekly deployments with rolling restarts.

All the above requires very diligent orchestration of storage node restarts to avoid availability drop. The orchestration needs to support varies TiDB deployment topologies including the ones spanning multiple k8s clusters, hence it can't be solved at the tidb-operator layer which is scoped to a single k8s cluster deployments only.

Describe the feature you'd like

This PR proposes to add an evict-store scheduler to PD which will help to strongly serialize any disruptive operations across tikv nodes like config change requiring restart, version upgrade, or underlying node rotation.

evict-store scheduler would be added for a specific store id before it prepares to go down and removed only when it joins cluster back and catches up. Thanks to a strong serializability of etcd operations in PD, this can prevent availability loss for all possible proactive node restarts on TiDB cluster which even span geographical regions.

Describe alternatives you've considered

In k8s environment these type of problems are usually tackled with Pod Disruption Budget. While it works well for stateless workloads, it does not integrate well with stateful systems like TiDB when you have to wait for catchup after pods come back. Moreover, this concept is scoped to a single k8s cluster and does not protect at all deployment across multiple regions or just across multiple k8s clusters.

There were previous attempts to solve the similar problem using pod annotations and enforced at tidb-operator level by this RFC. This approach does not prevent all races in case of annotations are applied to multiple nodes at approximately same time. Theoretically it could be prevented with memory lock in tidb-operator, but it won't cover tidb-operator restarts and has no way to serialize stores on TiDB clusters deployed across multiple k8s clusters.

Teachability, Documentation, Adoption, Migration Strategy

Proposed scheduler can be implemented independently of it's usage by tiup and tidb-operator. Once implemented, tidb-operator users can opt-in to use it using annotation on tikv section on the TiDB cluster CR like below.

The text was updated successfully, but these errors were encountered:

bufferflies · 2025-02-07T07:06:13Z

how about using the label offline to implement this behavior?

Tema · 2025-02-07T17:43:54Z

how about using the label offline to implement this behavior?

@bufferflies we can discuss implementation details in RFC if feature requests is accept. But I think we can piggy back on the existing annotations already introduced in https://github.com/pingcap/tidb-operator/blob/cb809c8954b07eab4138a9c4a3d993692caba577/docs/design-proposals/2021-11-24-graceful-restart-tikv-pod.md instead of adding a new one.

bufferflies · 2025-02-10T06:48:51Z

I may have grasped your idea. you want to implement the process of scale-out or scale-in the node on the PD side, not another component.

Tema · 2025-02-10T21:23:38Z

@niubell PTAL

niubell · 2025-02-11T07:16:56Z

Hi @Tema , it seems a profitable advice, to clarify this request, i will describe it according to my understanding:

Due to some configuration adjustments, disk damage replacement, version upgrades or other routine node replacement, hope to restart the tikv store smoothly.
Add a new kind of schedulers 'evict-store-scheduler' to evict this store before restart the corresponding tikv store, the new scheduler includes functions like bellow:
- evict region leaders.
- evict region followers(if open follower read/stale read).
- Will not be targeted for any region scheduling.
Remove evict-store-scheduler after the maintenance operation is finished, and the same tikv store can rejoin the cluster.
TiUP/TiDB-operator/TiDB cluster CR can also integrate this new scheduler to provide the function of sequential restart or other usage forms

Is my understanding above correct?

I also have one question: delete a tikv store and add a new one can also implement similar logic ,can the delete API satisfy this requirement?

Tema added the type/feature-request Categorizes issue or PR as related to a new feature. label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add store maintenance scheduler #9041

Add store maintenance scheduler #9041

Tema commented Feb 7, 2025

bufferflies commented Feb 7, 2025

Tema commented Feb 7, 2025

bufferflies commented Feb 10, 2025

Tema commented Feb 10, 2025

niubell commented Feb 11, 2025

Add store maintenance scheduler #9041

Add store maintenance scheduler #9041

Comments

Tema commented Feb 7, 2025

Feature Request

Describe your feature request related problem

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

bufferflies commented Feb 7, 2025

Tema commented Feb 7, 2025

bufferflies commented Feb 10, 2025

Tema commented Feb 10, 2025

niubell commented Feb 11, 2025