Purpose of `kube_pod_status_ready{condition="unknown"}` #2380

sidewinder12s · 2024-04-22T18:40:35Z

What happened:

What is the purpose of the unknown condition on kube_pod_status_ready metric?

kube_pod_status_ready{condition="unknown"} == 1

Searching through my metrics for a fairly high number of pods and nodes, I don't see that status ever being true.

So my initial impression is this is needlessly wasting cardinality in a way that is very hard to avoid.

Anything else we need to know?:

Environment:

kube-state-metrics version: 2.7
Kubernetes version (use kubectl version): 1.28
Cloud provider or hardware configuration: EKS
Other info:

The text was updated successfully, but these errors were encountered:

ricardoapl · 2024-04-23T16:09:49Z

As far as I can tell that seems to (still) be a valid value for a PodCondition. You can find (a little bit) more about it in Pod conditions.

sidewinder12s · 2024-04-23T16:32:14Z

I was suspecting that, just kind of surprised I hadn't seen that status actually in any of our metrics as we get containers in that state fairly regularly.

It's unfortunate KSM exposes these like this, the cardinality caused by it is extreme.

ricardoapl · 2024-04-23T17:02:51Z

It's unfortunate KSM exposes these like this, the cardinality caused by it is extreme.

On the one hand, I don't think it would be wise to omit such label value given that it seems to still be a valid status. On the other hand, I can see how this might not be of interest and could impact cardinality. I wonder if we could have some kind of metric cardinality enforcement mechanism like we have for kube-apiserver, though I'm not sure if it would be too much just for this one metric/label. There are probably better ways to solve this, though I can't think of anything right now.

I was suspecting that, just kind of surprised I hadn't seen that status actually in any of our metrics as we get containers in that state fairly regularly.

You mean you often have containers with unknown Ready condition status, yet they don't show up as such in your metrics?

sidewinder12s · 2024-04-23T17:28:17Z

KSM has a lot of the status metrics that are similar to this one where it has many label + value combinations for each pod that can explode cardinality and cost, especially if you are using a metrics vendor which bills by the cardinality/active series. So I think some broader cardinality enforcement might be broadly useful.

As far as Ready Status, I might be mixing ContainerStatusUnknown that I see on pods when I run kubectl get pods for example, with Ready status which indeed might be different.

jwilder · 2024-04-30T05:54:37Z

The metrics that have conditions are some are our most problematic ones as well. I've been thinking of two ways to address them:

For the True/False/Unknown cases like kube_pod_status_ready, return a value -1=Unknown, 0=False, 1=True and remove the "condition" label.
Only emit the labels/series for the True combination and the condition is actually True. This would help us for metrics like kube_node_status_condition and kube_pod_status_phase where we have many nodes * conditions/pods * phases, but vast majority are always 0 and we only really care about conditions that are True.

ricardoapl · 2024-04-30T15:11:33Z

For the True/False/Unknown cases like kube_pod_status_ready, return a value -1=Unknown, 0=False, 1=True and remove the "condition" label.

Some of these metrics are marked as stable, so removing a label might not be something we would want to do. However, I can't tell if kube-state-metrics really follows the Kubernetes metrics stability framework.

jwilder · 2024-04-30T16:01:03Z

Ah. Another option might be:

Leave existing metrics as is for stability, create new ones in the right shape and provide an option to disable the old ones if a user chooses. There is the metric allow/deny list that could be used to not expose the stable metrics. Would be nice to not even create them in memory just to drop them though.

sidewinder12s · 2024-04-30T18:39:07Z

For 2, one of our metrics vendors supports dropping data based on metric value, which we've implemented for many of our high cardinality metrics that are mostly zeros.

So far, many of them tend to fall into 2 camps:

An error/count metric that most of the time is zero and any increase indicates a problem
i. network errors/drops being a good example
A state metric, KSM being a huge user of it that indicates binary state through the metric value (and potentially in combination with label value changes)
i. All the status metrics

When we were adding these drops, the main drawback was that any gauges/math might not work without the zero value, but can be worked around.

For 3, I had wondered if anyone had proposed/questioned how overloaded some of these status metrics become with the # of labels/label combinations that end up being created. I don't really think there'd be huge downside to breaking these up into individual metrics vs the current solution.

Would another option be to have a runtime flag to just not emit the metrics if they are zero?

jwilder · 2024-04-30T23:11:38Z

Dropping zeros would be very useful. If there was an option to enable for KSM, that would cut out about ~55% of the series that we don't use.

I don't really think there'd be huge downside to breaking these up into individual metrics vs the current solution.

Agree. A metric like kube_pod_status_ready has the same information encoded twice. One series with condition=true and value 1, is the same as condition=false and value 0.

kube_pod_status_phase could be broken out to kube_pod_status_phase_pending, kube_pod_status_phase_succeeded, kube_pod_status_phase_failed and kube_pod_status_phase_running. With ability to drop 0 values, that would also cut out a lot excess data. This structure would also work better with the denylist to be more fine grained in what metrics are kept.

This adds a new flage `--metric-drop-value` that allows metrics with a value matching the flag to not be emitted. This is useful for reducing cardinality of metrics where many of the values are 0. In larger clusters, the size of the returned metrics can be quite large which can be costly to process and store in some environments. The plumbing adds support for filtering series by label name, label value or value, but this PR only sets up filtering by value. This idea for this feature was discussed in kubernetes#2380. Related kubernetes#2116

sidewinder12s · 2024-05-01T16:54:24Z

I do wonder if there is some use case to this structure we might be missing. But at scale the way these metrics are laid out is egregious.

This adds a new flage `--metric-drop-value` that allows metrics with a value matching the flag to not be emitted. This is useful for reducing cardinality of metrics where many of the values are 0. In larger clusters, the size of the returned metrics can be quite large which can be costly to process and store in some environments. The plumbing adds support for filtering series by label name, label value or value, but this PR only sets up filtering by value. This idea for this feature was discussed in kubernetes#2380. Related kubernetes#2116

logicalhan · 2024-05-02T16:50:50Z

/triage accepted
/kind support

logicalhan · 2024-05-02T16:51:23Z

/remove-kind bug

logicalhan · 2024-05-02T16:51:38Z

/assign @dgrisonnet

This adds a new flag `--metric-keep-true` that allows metrics with series indicating false/0 value to be dropped. This is useful for reducing cardinality of metrics where many of the values are 0. In larger clusters, the size of the returned metrics can be quite large which can be costly to process and store in some environments. The plumbing adds support for filtering series by label name, label value or value, but this PR only sets up filtering specifically for metrics with "condition" labels and various state metrics (e.g. kube_pod_status_phase). This idea for this feature was discussed in kubernetes#2380. Related kubernetes#2116

sidewinder12s added the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 22, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/support Categorizes issue or PR as a support question. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 2, 2024

k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label May 2, 2024

k8s-ci-robot assigned dgrisonnet May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purpose of `kube_pod_status_ready{condition="unknown"}` #2380

Purpose of `kube_pod_status_ready{condition="unknown"}` #2380

sidewinder12s commented Apr 22, 2024

ricardoapl commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024

ricardoapl commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024

jwilder commented Apr 30, 2024

ricardoapl commented Apr 30, 2024

jwilder commented Apr 30, 2024

sidewinder12s commented Apr 30, 2024

jwilder commented Apr 30, 2024

sidewinder12s commented May 1, 2024

logicalhan commented May 2, 2024

logicalhan commented May 2, 2024

logicalhan commented May 2, 2024

Purpose of kube_pod_status_ready{condition="unknown"} #2380

Purpose of kube_pod_status_ready{condition="unknown"} #2380

Comments

sidewinder12s commented Apr 22, 2024

ricardoapl commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024

ricardoapl commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024

jwilder commented Apr 30, 2024

ricardoapl commented Apr 30, 2024

jwilder commented Apr 30, 2024

sidewinder12s commented Apr 30, 2024

jwilder commented Apr 30, 2024

sidewinder12s commented May 1, 2024

logicalhan commented May 2, 2024

logicalhan commented May 2, 2024

logicalhan commented May 2, 2024

Purpose of `kube_pod_status_ready{condition="unknown"}` #2380

Purpose of `kube_pod_status_ready{condition="unknown"}` #2380