Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some kube-state-metrics shards are serving up stale metrics #2372

Open
schahal opened this issue Apr 16, 2024 · 5 comments
Open

Some kube-state-metrics shards are serving up stale metrics #2372

schahal opened this issue Apr 16, 2024 · 5 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@schahal
Copy link

schahal commented Apr 16, 2024

What happened:

We found some kube-state-metrics shards are serving up stale metrics.

For example, this pod is running and healthy:

$ kubectl get pods provider-kubernetes-a3cbbe355fa7-6d9d468f59-xbfsq
NAME                                                READY   STATUS    RESTARTS   AGE
provider-kubernetes-a3cbbe355fa7-6d9d468f59-xbfsq   1/1     Running   0          87m

However, we see for the past hour that kube_pod_container_status_waiting_reason is reporting it in CreatingContainer:

Screenshot 2024-04-16 at 2 00 31 PM

And to prove this is being served by KSM, we looked at the incriminating shard's (kube-state-metrics-5) /metrics endpoint and saw this metric is definitely stale:

kube_pod_container_status_waiting_reason{namespace="<redacted>",pod="provider-kubernetes-a3cbbe355fa7-678fd88bc5-76dw4",uid="<redacted>",container="package-runtime",reason="ContainerCreating"} 1

This is one such example, there seem to be several such situations.

What you expected to happen:

Expectation is that the metric(s) match reality

How to reproduce it (as minimally and precisely as possible):

Unfortunately, we're not quite sure when/why it gets into this state (anecdotally, it almost always happens when we upgrade KSM, though today there was no update besides some Prometheus agents)

We can mitigate the issue by restarting all the KSM shards... e.g.,

$ kubectl rollout restart -n kube-state-metrics statefulset kube-state-metrics

... if that's any clue to determine root cause.

Anything else we need to know?:

  1. When I originally ran into the problem, I thought it had something to do with the Compatibility Matrix. But starting with KSM v2.11.0, I confirmed the client libraries are updated for my version of k8s (v1.28)

  2. There's nothing out of the ordinary in the KSM logs:

Click to view kube-state-metrics-5 logs
I0409 08:17:51.349017       1 wrapper.go:120] "Starting kube-state-metrics"
W0409 08:17:51.349231       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0409 08:17:51.350019       1 server.go:199] "Used resources" resources=["limitranges","storageclasses","deployments","resourcequotas","statefulsets","cronjobs","endpoints","ingresses","namespaces","nodes","poddisruptionbudgets","mutatingwebhookconfigurations","replicasets","horizontalpodautoscalers","networkpolicies","validatingwebhookconfigurations","volumeattachments","daemonsets","jobs","services","certificatesigningrequests","configmaps","persistentvolumeclaims","replicationcontrollers","secrets","persistentvolumes","pods"]
I0409 08:17:51.350206       1 types.go:227] "Using all namespaces"
I0409 08:17:51.350225       1 types.go:145] "Using node type is nil"
I0409 08:17:51.350241       1 server.go:226] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: "
W0409 08:17:51.350258       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0409 08:17:51.350658       1 utils.go:70] "Tested communication with server"
I0409 08:17:52.420690       1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28+" gitVersion="v1.28.6-eks-508b6b3" gitTreeState="clean" gitCommit="25a726351cee8ee6facce01af4214605e089d5da" platform="linux/amd64"
I0409 08:17:52.420837       1 utils.go:76] "Communication with server successful"
I0409 08:17:52.422588       1 server.go:350] "Started metrics server" metricsServerAddress="[::]:8080"
I0409 08:17:52.422595       1 server.go:339] "Started kube-state-metrics self metrics server" telemetryAddress="[::]:8081"
I0409 08:17:52.423030       1 server.go:73] levelinfomsgListening onaddress[::]:8080
I0409 08:17:52.423052       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress[::]:8080
I0409 08:17:52.423075       1 server.go:73] levelinfomsgListening onaddress[::]:8081
I0409 08:17:52.423093       1 server.go:73] levelinfomsgTLS is disabled.http2falseaddress[::]:8081
I0409 08:17:55.422262       1 config.go:84] "Using custom resource plural" resource="autoscaling.k8s.io_v1_VerticalPodAutoscaler" plural="verticalpodautoscalers"
I0409 08:17:55.422479       1 discovery.go:274] "discovery finished, cache updated"
I0409 08:17:55.422544       1 metrics_handler.go:106] "Autosharding enabled with pod" pod="kube-state-metrics/kube-state-metrics-5"
I0409 08:17:55.422573       1 metrics_handler.go:107] "Auto detecting sharding settings"
I0409 08:17:55.430380       1 metrics_handler.go:82] "Configuring sharding of this instance to be shard index (zero-indexed) out of total shards" shard=5 totalShards=16
I0409 08:17:55.431104       1 custom_resource_metrics.go:79] "Custom resource state added metrics" familyNames=["kube_customresource_vpa_containerrecommendations_target","kube_customresource_vpa_containerrecommendations_target"]
I0409 08:17:55.431143       1 builder.go:282] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments,autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423216       1 config.go:84] "Using custom resource plural" resource="autoscaling.k8s.io_v1_VerticalPodAutoscaler" plural="verticalpodautoscalers"
I0416 16:47:01.423283       1 config.go:209] "reloaded factory" GVR="autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423466       1 builder.go:208] "Updating store" GVR="autoscaling.k8s.io/v1, Resource=verticalpodautoscalers"
I0416 16:47:01.423499       1 discovery.go:274] "discovery finished, cache updated"
I0416 16:47:01.423527       1 metrics_handler.go:106] "Autosharding enabled with pod" pod="kube-state-metrics/kube-state-metrics-5"
I0416 16:47:01.423545       1 metrics_handler.go:107] "Auto detecting sharding settings"
  1. This may be related to kube-state-metrics with autosharding stops updating shards when the labels of the statefulset are updated #2355 but I'm not sure about that linked PR to decide conclusively.

Environment:

  • kube-state-metrics version: v2.12.0 (this has occurred in previous versions too)
  • Kubernetes version (use kubectl version): v1.28.6
  • Cloud provider or hardware configuration: EKS
  • Other info:
@schahal schahal added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 16, 2024
@CatherineF-dev
Copy link
Contributor

qq: have your Statefulset labels been changed?

@schahal
Copy link
Author

schahal commented Apr 18, 2024

have your Statefulset labels been changed?

For this particular case, we don't suspect they'd changed (tho we drop the metric to confirm this 100%).

But for other cases that we run into this issue, almost always the labels get changed, particularly the chart version when we upgrade:

Labels:             app.kubernetes.io/component=metrics
                    app.kubernetes.io/instance=kube-state-metrics
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=kube-state-metrics
                    app.kubernetes.io/part-of=kube-state-metrics
                    app.kubernetes.io/version=2.12.0
                    helm.sh/chart=kube-state-metrics-5.18.1
                    release=kube-state-metrics

@logicalhan
Copy link
Member

/assign @CatherineF-dev
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024
@CatherineF-dev
Copy link
Contributor

But for other cases that we run into this issue, almost always the labels get changed, particularly the chart version when we upgrade:

This is related to #2347

For this particular case, we don't suspect they'd changed (tho we drop the metric to confirm this 100%).

This is a new issue.

@LaikaN57 LaikaN57 mentioned this issue May 1, 2024
@schahal
Copy link
Author

schahal commented May 6, 2024

This is related to #2347

For the purposes of this issue, I think it's wholly related to #2347 (the one time we claimed the statefulset may not have changed labels, we had no proof of that).

IMO, we can track this issue to that PR for closure (and if we do see another case of stale metrics, we can try to gather those exact circumstances in a, if needed, separate issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants