Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purpose of kube_pod_status_ready{condition="unknown"} #2380

Open
sidewinder12s opened this issue Apr 22, 2024 · 13 comments
Open

Purpose of kube_pod_status_ready{condition="unknown"} #2380

sidewinder12s opened this issue Apr 22, 2024 · 13 comments
Assignees
Labels
kind/support Categorizes issue or PR as a support question. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sidewinder12s
Copy link

What happened:

What is the purpose of the unknown condition on kube_pod_status_ready metric?

kube_pod_status_ready{condition="unknown"} == 1

Searching through my metrics for a fairly high number of pods and nodes, I don't see that status ever being true.

So my initial impression is this is needlessly wasting cardinality in a way that is very hard to avoid.

Anything else we need to know?:

Environment:

  • kube-state-metrics version: 2.7
  • Kubernetes version (use kubectl version): 1.28
  • Cloud provider or hardware configuration: EKS
  • Other info:
@sidewinder12s sidewinder12s added the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 22, 2024
@ricardoapl
Copy link
Contributor

As far as I can tell that seems to (still) be a valid value for a PodCondition. You can find (a little bit) more about it in Pod conditions.

@sidewinder12s
Copy link
Author

I was suspecting that, just kind of surprised I hadn't seen that status actually in any of our metrics as we get containers in that state fairly regularly.

It's unfortunate KSM exposes these like this, the cardinality caused by it is extreme.

@ricardoapl
Copy link
Contributor

It's unfortunate KSM exposes these like this, the cardinality caused by it is extreme.

On the one hand, I don't think it would be wise to omit such label value given that it seems to still be a valid status. On the other hand, I can see how this might not be of interest and could impact cardinality. I wonder if we could have some kind of metric cardinality enforcement mechanism like we have for kube-apiserver, though I'm not sure if it would be too much just for this one metric/label. There are probably better ways to solve this, though I can't think of anything right now.

I was suspecting that, just kind of surprised I hadn't seen that status actually in any of our metrics as we get containers in that state fairly regularly.

You mean you often have containers with unknown Ready condition status, yet they don't show up as such in your metrics?

@sidewinder12s
Copy link
Author

KSM has a lot of the status metrics that are similar to this one where it has many label + value combinations for each pod that can explode cardinality and cost, especially if you are using a metrics vendor which bills by the cardinality/active series. So I think some broader cardinality enforcement might be broadly useful.

As far as Ready Status, I might be mixing ContainerStatusUnknown that I see on pods when I run kubectl get pods for example, with Ready status which indeed might be different.

@jwilder
Copy link
Contributor

jwilder commented Apr 30, 2024

The metrics that have conditions are some are our most problematic ones as well. I've been thinking of two ways to address them:

  1. For the True/False/Unknown cases like kube_pod_status_ready, return a value -1=Unknown, 0=False, 1=True and remove the "condition" label.
  2. Only emit the labels/series for the True combination and the condition is actually True. This would help us for metrics like kube_node_status_condition and kube_pod_status_phase where we have many nodes * conditions/pods * phases, but vast majority are always 0 and we only really care about conditions that are True.

@ricardoapl
Copy link
Contributor

For the True/False/Unknown cases like kube_pod_status_ready, return a value -1=Unknown, 0=False, 1=True and remove the "condition" label.

Some of these metrics are marked as stable, so removing a label might not be something we would want to do. However, I can't tell if kube-state-metrics really follows the Kubernetes metrics stability framework.

@jwilder
Copy link
Contributor

jwilder commented Apr 30, 2024

Ah. Another option might be:

  1. Leave existing metrics as is for stability, create new ones in the right shape and provide an option to disable the old ones if a user chooses. There is the metric allow/deny list that could be used to not expose the stable metrics. Would be nice to not even create them in memory just to drop them though.

@sidewinder12s
Copy link
Author

For 2, one of our metrics vendors supports dropping data based on metric value, which we've implemented for many of our high cardinality metrics that are mostly zeros.

So far, many of them tend to fall into 2 camps:

  1. An error/count metric that most of the time is zero and any increase indicates a problem
    i. network errors/drops being a good example
  2. A state metric, KSM being a huge user of it that indicates binary state through the metric value (and potentially in combination with label value changes)
    i. All the status metrics

When we were adding these drops, the main drawback was that any gauges/math might not work without the zero value, but can be worked around.

For 3, I had wondered if anyone had proposed/questioned how overloaded some of these status metrics become with the # of labels/label combinations that end up being created. I don't really think there'd be huge downside to breaking these up into individual metrics vs the current solution.

Would another option be to have a runtime flag to just not emit the metrics if they are zero?

@jwilder
Copy link
Contributor

jwilder commented Apr 30, 2024

Dropping zeros would be very useful. If there was an option to enable for KSM, that would cut out about ~55% of the series that we don't use.

I don't really think there'd be huge downside to breaking these up into individual metrics vs the current solution.

Agree. A metric like kube_pod_status_ready has the same information encoded twice. One series with condition=true and value 1, is the same as condition=false and value 0.

kube_pod_status_phase could be broken out to kube_pod_status_phase_pending, kube_pod_status_phase_succeeded, kube_pod_status_phase_failed and kube_pod_status_phase_running. With ability to drop 0 values, that would also cut out a lot excess data. This structure would also work better with the denylist to be more fine grained in what metrics are kept.

jwilder added a commit to jwilder/kube-state-metrics that referenced this issue May 1, 2024
This adds a new flage `--metric-drop-value` that allows metrics
with a value matching the flag to not be emitted.  This is useful
for reducing cardinality of metrics where many of the values are 0.
In larger clusters, the size of the returned metrics can be quite large
which can be costly to process and store in some environments.

The plumbing adds support for filtering series by label name, label value
or value, but this PR only sets up filtering by value.

This idea for this feature was discussed in kubernetes#2380.

Related kubernetes#2116
@sidewinder12s
Copy link
Author

I do wonder if there is some use case to this structure we might be missing. But at scale the way these metrics are laid out is egregious.

jwilder added a commit to jwilder/kube-state-metrics that referenced this issue May 1, 2024
This adds a new flage `--metric-drop-value` that allows metrics
with a value matching the flag to not be emitted.  This is useful
for reducing cardinality of metrics where many of the values are 0.
In larger clusters, the size of the returned metrics can be quite large
which can be costly to process and store in some environments.

The plumbing adds support for filtering series by label name, label value
or value, but this PR only sets up filtering by value.

This idea for this feature was discussed in kubernetes#2380.

Related kubernetes#2116
@logicalhan
Copy link
Member

/triage accepted
/kind support

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/support Categorizes issue or PR as a support question. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 2, 2024
@logicalhan
Copy link
Member

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label May 2, 2024
@logicalhan
Copy link
Member

/assign @dgrisonnet

jwilder added a commit to jwilder/kube-state-metrics that referenced this issue May 3, 2024
This adds a new flag `--metric-keep-true` that allows metrics
with series indicating false/0 value to be dropped.  This is useful
for reducing cardinality of metrics where many of the values are 0.
In larger clusters, the size of the returned metrics can be quite large
which can be costly to process and store in some environments.

The plumbing adds support for filtering series by label name, label value
or value, but this PR only sets up filtering specifically for metrics
with "condition" labels and various state metrics (e.g. kube_pod_status_phase).

This idea for this feature was discussed in kubernetes#2380.

Related kubernetes#2116
jwilder added a commit to jwilder/kube-state-metrics that referenced this issue May 3, 2024
This adds a new flag `--metric-keep-true` that allows metrics
with series indicating false/0 value to be dropped.  This is useful
for reducing cardinality of metrics where many of the values are 0.
In larger clusters, the size of the returned metrics can be quite large
which can be costly to process and store in some environments.

The plumbing adds support for filtering series by label name, label value
or value, but this PR only sets up filtering specifically for metrics
with "condition" labels and various state metrics (e.g. kube_pod_status_phase).

This idea for this feature was discussed in kubernetes#2380.

Related kubernetes#2116
jwilder added a commit to jwilder/kube-state-metrics that referenced this issue May 3, 2024
This adds a new flag `--metric-keep-true` that allows metrics
with series indicating false/0 value to be dropped.  This is useful
for reducing cardinality of metrics where many of the values are 0.
In larger clusters, the size of the returned metrics can be quite large
which can be costly to process and store in some environments.

The plumbing adds support for filtering series by label name, label value
or value, but this PR only sets up filtering specifically for metrics
with "condition" labels and various state metrics (e.g. kube_pod_status_phase).

This idea for this feature was discussed in kubernetes#2380.

Related kubernetes#2116
jwilder added a commit to jwilder/kube-state-metrics that referenced this issue May 9, 2024
This adds a new flag `--metric-keep-true` that allows metrics
with series indicating false/0 value to be dropped.  This is useful
for reducing cardinality of metrics where many of the values are 0.
In larger clusters, the size of the returned metrics can be quite large
which can be costly to process and store in some environments.

The plumbing adds support for filtering series by label name, label value
or value, but this PR only sets up filtering specifically for metrics
with "condition" labels and various state metrics (e.g. kube_pod_status_phase).

This idea for this feature was discussed in kubernetes#2380.

Related kubernetes#2116
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants