Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change etcd liveness probes to the new livez and readyz endpoints #3039

Closed
siyuanfoundation opened this issue Apr 1, 2024 · 11 comments · Fixed by kubernetes/kubernetes#124465
Assignees
Labels
area/etcd kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@siyuanfoundation
Copy link

This is related to etcd-io/etcd#16007

Since v3.5.11, etcd has added new livez/readyz HTTP endpoints
The design for the new probes is documented in etcd livez and readyz probes

The new probes are Kubernetes API compliant, have more fine-grained metrics, and are better suited for liveness and readiness checks than /health?exclude=NOSPACE&serializable=true, and /health?serializable=false

Probe endpoint used by kubeadm endpoint proposed to use difference
LivenessProbe /health?exclude=NOSPACE&serializable=true /livez /livez does not check for leaders because restarting local server does not mitigate the problem of cluster missing leader, while a crashing loop would worsen the problem.
StartupProbe /health?serializable=false /readyz /readyz ignores the NOSPACE alarm, because if the local server is out of quota, the server is still able to take read/delete requests, and still able forward write requests to the leader to write successfully in the cluster.
ReadinessProbe None /readyz

We should switch the current etcd LivenessProbe and StartupProbe to the new endpoints which are Kubernetes API compliant.

Is this a BUG REPORT or FEATURE REQUEST?

FEATURE REQUEST

@siyuanfoundation
Copy link
Author

siyuanfoundation commented Apr 1, 2024

cc @serathius @ahrtr

@neolit123
Copy link
Member

Since v3.5.11, etcd has added new livez/readyz HTTP endpoints

Q: is there are reason etcd released this feature under a patch release? perhaps it was under a bugfix.

kubeadm 1.30 (to be released) is using 3.12
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/constants/constants.go#L481-L491

so for kubeadm 1.31 we can switch to the new probe method.
it might make more sense to switch in the kubeadm MINOR version that starts using 3.6 so that our users will know a bigger change happened, but perhaps this is not needed.

generally more opinions on this topic are appreciated.
cc @SataQiu @pacoxu

@neolit123 neolit123 added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. kind/feature Categorizes issue or PR as related to a new feature. area/etcd labels Apr 2, 2024
@neolit123 neolit123 added this to the v1.31 milestone Apr 2, 2024
@neolit123
Copy link
Member

neolit123 commented Apr 2, 2024

/readyz

do we expect any behavior changes if we add a readiness probe for etcd in kubeadm?
currently it indeed don't have a readiness probe.

@pacoxu
Copy link
Member

pacoxu commented Apr 2, 2024

Since we backport etcd minor version to previous releases as well, it's quite safe to change the probe in v1.31.

https://github.com/kubernetes/kubernetes/blob/release-1.27/cmd/kubeadm/app/constants/constants.go#L481-L486

@pacoxu
Copy link
Member

pacoxu commented Apr 2, 2024

@SataQiu
Copy link
Member

SataQiu commented Apr 2, 2024

+1 for applying the new livez/readyz probe in v1.31

do we expect any behavior changes if we add a readiness probe for etcd in kubeadm?
currently it indeed don't have a readiness probe.

That seems safe because we have already checked the etcd cluster health by client and have adopted a retry mechanism.

@ahrtr
Copy link
Member

ahrtr commented Apr 2, 2024

Q: is there are reason etcd released this feature under a patch release? perhaps it was under a bugfix.

I raised the same question/comment when I was reviewing the lives/readz design doc. Actually I wasn't a big fan of backporting it to stable releases, but I did not strongly against it, reasons,

  • It's backward compatible; It doesn't change the existing /health endpoint.
  • It isn't the core part of etcdserver;
  • It's strongly desired by some people in the community, although no strong justification provided.

so for kubeadm 1.31 we can switch to the new probe method.
it might make more sense to switch in the kubeadm MINOR version that starts using 3.6 so that our users will know a bigger change happened, but perhaps this is not needed.

generally more opinions on this topic are appreciated.

I believe there isn't any behaviour change in etcd's 3.5 and 3.4 patch releases in term of the new /livez and /readz probes. Specifically, /livez is equivalent to /health?exclude=NOSPACE&serializable=true, and /readz is equivalent to /health?serializable=false.

We may extend both /livez and /readyz in main (3.6.0), but as mentioned above, there shouldn't be any behaviour or breaking change in 3.5 and. 3.4.

It's totally up to kubeadm / cluster-life-cycle's maintainers / leads to decide when to use the new /livez and /readyz endpoints. But please do not expect we can get etcd 3.6.0 released soon.

Since we backport etcd minor version to previous releases as well, it's quite safe to change the probe in v1.31.

I think they are orthogonal changes.

BTW, do we plan to add the ReadinessProbe?

We can reuse the /readyz for the startupProbe. If you or anyone insist on adding a dedicated endpoint for it, please Let's discuss it under etcd-io/etcd#16007.

@neolit123
Copy link
Member

It's totally up to kubeadm / cluster-life-cycle's maintainers / leads to decide when to use the new /livez and /readyz endpoints. But please do not expect we can get etcd 3.6.0 released soon.

ok, seems like this can be added in kubeadm 1.31 once code freeze for 1.30 is over.
thanks for the info

@neolit123
Copy link
Member

neolit123 commented Apr 23, 2024

@siyuanfoundation @ahrtr any notable regressions in >= 3.5.11 that may keep users on < 3.5.11?
also i assume we really do recommend users to move to the latest PATCH of 3.5 due to CVE fixes, correct?

i have a WIP PR for this here, but it does not do versions checking and it will cause an error for kubeadm users that are deploying a custom etcd version < 3.5.11 on kubeadm 1.31. i guess etcd will just fail to start.

to be on the safe side the PR can also include some version checking, but it will not work for custom image strings as some users pass SHA for the etcd image. it would only work for users that are using a semver in the etcd image.

i think my preference would be to just have this mentioned in the release note, and also make kubeadm users stuck on older custom etcd version to finally upgrade!

kubernetes/kubernetes#124465

@neolit123 neolit123 self-assigned this Apr 23, 2024
@siyuanfoundation
Copy link
Author

@neolit123 Thanks for working on this!
I don't think there is any notable regressions in >= 3.5.11.
I like the idea of have this mentioned in the release note, and also make kubeadm users stuck on older custom etcd version to finally upgrade!

@ahrtr
Copy link
Member

ahrtr commented Apr 23, 2024

@siyuanfoundation @ahrtr any notable regressions in >= 3.5.11 that may keep users on < 3.5.11?
also i assume we really do recommend users to move to the latest PATCH of 3.5 due to CVE fixes, correct?

No regression in >= 3.5.11.

Yes, it's recommended to bump etcd to the latest patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants