Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add store maintenance scheduler #9041

Open
Tema opened this issue Feb 7, 2025 · 5 comments
Open

Add store maintenance scheduler #9041

Tema opened this issue Feb 7, 2025 · 5 comments
Labels
type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@Tema
Copy link
Contributor

Tema commented Feb 7, 2025

Feature Request

Describe your feature request related problem

TiKV availability is based on the quorum. Thanks to excellent region balancing across all nodes in the cluster, if any two tikv nodes restart at the same time (in case of a standard replication factor of 3), some data becomes unavailable for the duration of restart.

In a modern cloud environment node replacement is a routine operation. Some companies prefer to restart every node every few weeks to bring latest changes to the runtime environment and ensure that there is no slow leak of resources. Also modern operational practices pursue continous deployments which can result in weekly deployments with rolling restarts.

All the above requires very diligent orchestration of storage node restarts to avoid availability drop. The orchestration needs to support varies TiDB deployment topologies including the ones spanning multiple k8s clusters, hence it can't be solved at the tidb-operator layer which is scoped to a single k8s cluster deployments only.

Describe the feature you'd like

This PR proposes to add an evict-store scheduler to PD which will help to strongly serialize any disruptive operations across tikv nodes like config change requiring restart, version upgrade, or underlying node rotation.

evict-store scheduler would be added for a specific store id before it prepares to go down and removed only when it joins cluster back and catches up. Thanks to a strong serializability of etcd operations in PD, this can prevent availability loss for all possible proactive node restarts on TiDB cluster which even span geographical regions.

Describe alternatives you've considered

In k8s environment these type of problems are usually tackled with Pod Disruption Budget. While it works well for stateless workloads, it does not integrate well with stateful systems like TiDB when you have to wait for catchup after pods come back. Moreover, this concept is scoped to a single k8s cluster and does not protect at all deployment across multiple regions or just across multiple k8s clusters.

There were previous attempts to solve the similar problem using pod annotations and enforced at tidb-operator level by this RFC. This approach does not prevent all races in case of annotations are applied to multiple nodes at approximately same time. Theoretically it could be prevented with memory lock in tidb-operator, but it won't cover tidb-operator restarts and has no way to serialize stores on TiDB clusters deployed across multiple k8s clusters.

Teachability, Documentation, Adoption, Migration Strategy

Proposed scheduler can be implemented independently of it's usage by tiup and tidb-operator. Once implemented, tidb-operator users can opt-in to use it using annotation on tikv section on the TiDB cluster CR like below.

@Tema Tema added the type/feature-request Categorizes issue or PR as related to a new feature. label Feb 7, 2025
@bufferflies
Copy link
Contributor

how about using the label offline to implement this behavior?

@Tema
Copy link
Contributor Author

Tema commented Feb 7, 2025

how about using the label offline to implement this behavior?

@bufferflies we can discuss implementation details in RFC if feature requests is accept. But I think we can piggy back on the existing annotations already introduced in https://github.com/pingcap/tidb-operator/blob/cb809c8954b07eab4138a9c4a3d993692caba577/docs/design-proposals/2021-11-24-graceful-restart-tikv-pod.md instead of adding a new one.

@bufferflies
Copy link
Contributor

I may have grasped your idea. you want to implement the process of scale-out or scale-in the node on the PD side, not another component.

@Tema
Copy link
Contributor Author

Tema commented Feb 10, 2025

@niubell PTAL

@niubell
Copy link
Contributor

niubell commented Feb 11, 2025

Hi @Tema , it seems a profitable advice, to clarify this request, i will describe it according to my understanding:

  • Due to some configuration adjustments, disk damage replacement, version upgrades or other routine node replacement, hope to restart the tikv store smoothly.
  • Add a new kind of schedulers 'evict-store-scheduler' to evict this store before restart the corresponding tikv store, the new scheduler includes functions like bellow:
    • evict region leaders.
    • evict region followers(if open follower read/stale read).
    • Will not be targeted for any region scheduling.
  • Remove evict-store-scheduler after the maintenance operation is finished, and the same tikv store can rejoin the cluster.
  • TiUP/TiDB-operator/TiDB cluster CR can also integrate this new scheduler to provide the function of sequential restart or other usage forms

Is my understanding above correct?

I also have one question: delete a tikv store and add a new one can also implement similar logic ,can the delete API satisfy this requirement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants