-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add store maintenance scheduler #9041
Comments
how about using the label |
@bufferflies we can discuss implementation details in RFC if feature requests is accept. But I think we can piggy back on the existing annotations already introduced in https://github.com/pingcap/tidb-operator/blob/cb809c8954b07eab4138a9c4a3d993692caba577/docs/design-proposals/2021-11-24-graceful-restart-tikv-pod.md instead of adding a new one. |
I may have grasped your idea. you want to implement the process of scale-out or scale-in the node on the PD side, not another component. |
@niubell PTAL |
Hi @Tema , it seems a profitable advice, to clarify this request, i will describe it according to my understanding:
Is my understanding above correct? I also have one question: delete a tikv store and add a new one can also implement similar logic ,can the delete API satisfy this requirement? |
Feature Request
Describe your feature request related problem
TiKV availability is based on the quorum. Thanks to excellent region balancing across all nodes in the cluster, if any two tikv nodes restart at the same time (in case of a standard replication factor of 3), some data becomes unavailable for the duration of restart.
In a modern cloud environment node replacement is a routine operation. Some companies prefer to restart every node every few weeks to bring latest changes to the runtime environment and ensure that there is no slow leak of resources. Also modern operational practices pursue continous deployments which can result in weekly deployments with rolling restarts.
All the above requires very diligent orchestration of storage node restarts to avoid availability drop. The orchestration needs to support varies TiDB deployment topologies including the ones spanning multiple k8s clusters, hence it can't be solved at the tidb-operator layer which is scoped to a single k8s cluster deployments only.
Describe the feature you'd like
This PR proposes to add an
evict-store
scheduler to PD which will help to strongly serialize any disruptive operations acrosstikv
nodes like config change requiring restart, version upgrade, or underlying node rotation.evict-store
scheduler would be added for a specific store id before it prepares to go down and removed only when it joins cluster back and catches up. Thanks to a strong serializability ofetc
d operations inPD
, this can prevent availability loss for all possible proactive node restarts on TiDB cluster which even span geographical regions.Describe alternatives you've considered
In k8s environment these type of problems are usually tackled with Pod Disruption Budget. While it works well for stateless workloads, it does not integrate well with stateful systems like TiDB when you have to wait for catchup after pods come back. Moreover, this concept is scoped to a single k8s cluster and does not protect at all deployment across multiple regions or just across multiple k8s clusters.
There were previous attempts to solve the similar problem using pod annotations and enforced at
tidb-operator
level by this RFC. This approach does not prevent all races in case of annotations are applied to multiple nodes at approximately same time. Theoretically it could be prevented with memory lock in tidb-operator, but it won't cover tidb-operator restarts and has no way to serialize stores on TiDB clusters deployed across multiple k8s clusters.Teachability, Documentation, Adoption, Migration Strategy
Proposed scheduler can be implemented independently of it's usage by tiup and tidb-operator. Once implemented, tidb-operator users can opt-in to use it using annotation on tikv section on the TiDB cluster CR like below.
The text was updated successfully, but these errors were encountered: