Local queues prometheus metrics #1833

astefanutti · 2024-03-13T13:25:38Z

What would you like to be added:

Expose Prometheus metrics for local queues, equivalent to the existing cluster queue metrics, but filtered and labeled by local queues.

Similarly to the visibility API, that serves information about pending workloads in local queues, it would be possible to get metrics like like pending workloads, admitted active workloads, resource usage, etc, for local queues.

If cardinality is a concern, those metrics could be exposed behind a feature flag.

Why is this needed:

Metrics about local queues can be useful for the batch users persona, so they can have visibility and historical trends about their workloads.

While some metrics are already available for cluster queues, exposing them to the batch users persona presents the following challenges / limitations:

Cluster queues metrics are global and cannot be filtered by namespaces / tenants
Querying "cluster-scoped" metrics in secured Prometheus instances is generally only authorised for cluster admin users that have access to all namespaces / tenants

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

astefanutti · 2024-03-13T13:30:29Z

@alculquicondor @tenzen-y Do you think that'd be a useful / possible enhancement?

alculquicondor · 2024-03-13T13:58:54Z

If cardinality is a concern, those metrics could be exposed behind a feature flag.

Yes, that's the primary concern. I wouldn't make it a feature flag, but a long-term configuration field.

tenzen-y · 2024-03-13T14:19:23Z

@alculquicondor @tenzen-y Do you think that'd be a useful / possible enhancement?

I understand that this feature is so useful, but I have the same concern with @alculquicondor.
IIRC, previously, we had similar discussions when we designed Visibility/Visibility On-demand.
So, if we introduce this feature, configurable this feature by Config API would be better.

Anyway, I guess that having a small KEP would be better since we may extend the existing Config API.

astefanutti · 2024-03-13T15:11:56Z

That makes sense. There could be one or two options added to the Config API, similar to the existing .metrics.enableClusterQueueResources, like .metrics.enableLocalQueues and .metrics.enableLocalQueueResources.

I can work on a small KEP if you guys give the green light.

alculquicondor · 2024-03-13T15:44:03Z

It seems simple enough.

tenzen-y · 2024-03-13T15:49:26Z

SGTM

astefanutti · 2024-03-13T15:55:25Z

Thanks for your quick feedback! I'll work on it asap.

/assign

k8s-triage-robot · 2024-06-11T16:51:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor · 2024-06-24T17:02:53Z

@astefanutti are you still looking into this?

astefanutti · 2024-06-25T06:51:32Z

@alculquicondor I haven't, but hopefully we'll get back to it soon.

astefanutti · 2024-06-25T06:51:42Z

/remove-lifecycle stale

astefanutti · 2024-06-25T06:51:55Z

/unassign

This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: kubernetes-sigs#1833 Signed-off-by: Varsha Prasad Narsing <[email protected]>

alculquicondor · 2024-07-02T17:40:00Z

@varshaprasad96, please write /assign in a comment to claim this issue. It's important to communicate that you are working on an issue so that other contributors don't try to work on the same thing.

varshaprasad96 · 2024-07-10T07:13:42Z

/assign

* [Feature] Enable prometheus metrics for local queues This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: #1833 Signed-off-by: Varsha Prasad Narsing <[email protected]> * Address reviews This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing <[email protected]> --------- Signed-off-by: Varsha Prasad Narsing <[email protected]>

k8s-triage-robot · 2024-10-08T07:57:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y · 2024-10-08T16:52:50Z

/remove-lifecycle stale

@varshaprasad96 Do you still work on this enhancement?

varshaprasad96 · 2024-10-08T16:54:54Z

@tenzen-y Yes. I'm planning to get the implementation PR up by next few days.

tenzen-y · 2024-10-08T16:56:04Z

@tenzen-y Yes. I'm planning to get the implementation PR up by next few days.

Awsome, thanks for your effort!

KPostOffice · 2024-11-18T16:52:34Z

/assign

KPostOffice · 2024-11-18T17:22:27Z

I've been working on the this KEP for the last couple days and overall it seemed pretty straightforward, until I got to adding the LocalQueueByStatus metric since afaict the status metric for CQs is a bubbling up of the internal cq.status. The only representation of LQ states exists inside the CQ struct, and I felt like adding all LQs to the cache struct felt wrong. My current thought is to add a new Type in metrics LocalQueueStatus which has the following values (active, pending, and orphaned) where the LQ will inherit the status from its CQ parent. If the parent is terminating the LQ status will move to pending and if the CQ is deleted then the status would move to orphaned. When reconciling the LQ to update status I can just directly grab the status from the CQ in cache.

KPostOffice · 2024-11-18T17:23:12Z

cc @mimowo @PBundyra @tenzen-y

…#2516) * [Feature] Enable prometheus metrics for local queues This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: kubernetes-sigs#1833 Signed-off-by: Varsha Prasad Narsing <[email protected]> * Address reviews This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing <[email protected]> --------- Signed-off-by: Varsha Prasad Narsing <[email protected]>

varshaprasad96 · 2024-11-19T14:29:52Z

@KPostOffice Could you elaborate on where exactly in localqueue_controller are you trying to update the metrics. IIUC - the local queue reconciler watches the CQ, and any status updates would be reflected in here.

I felt like adding all LQs to the cache struct felt wrong

Also, the local_queue has reference to the respective cluster queue. Can't we just directly query CQ's cache status instead while reporting metrics?

KPostOffice · 2024-11-19T15:52:58Z

@varshaprasad96 I wasn't planning on adding the metrics to the localqueue_controller I was adding them to either manager.go or cache so that the metrics were reflective of Kueue's operational state, this is what is done for CQ status metrics, see here. I'm having trouble figuring out how to exactly represent the LQ's status. I think it can just update when either:

a LQ is added
a LQ is updated
the LQ's underlying CQ status is updated

astefanutti added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 13, 2024

k8s-ci-robot assigned astefanutti Mar 13, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2024

k8s-ci-robot unassigned astefanutti Jun 25, 2024

varshaprasad96 mentioned this issue Jul 2, 2024

[Feature] Enable prometheus metrics for local queues #2516

Merged

k8s-ci-robot assigned varshaprasad96 Jul 10, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2024

k8s-ci-robot assigned KPostOffice Nov 18, 2024

KPostOffice linked a pull request Nov 21, 2024 that will close this issue

Implement/local metrics #3609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local queues prometheus metrics #1833

Local queues prometheus metrics #1833

astefanutti commented Mar 13, 2024 •

edited

Loading

astefanutti commented Mar 13, 2024

alculquicondor commented Mar 13, 2024 •

edited

Loading

tenzen-y commented Mar 13, 2024

astefanutti commented Mar 13, 2024

alculquicondor commented Mar 13, 2024

tenzen-y commented Mar 13, 2024

astefanutti commented Mar 13, 2024

k8s-triage-robot commented Jun 11, 2024

alculquicondor commented Jun 24, 2024

astefanutti commented Jun 25, 2024

astefanutti commented Jun 25, 2024

astefanutti commented Jun 25, 2024

alculquicondor commented Jul 2, 2024

varshaprasad96 commented Jul 10, 2024

k8s-triage-robot commented Oct 8, 2024

tenzen-y commented Oct 8, 2024

varshaprasad96 commented Oct 8, 2024

tenzen-y commented Oct 8, 2024

KPostOffice commented Nov 18, 2024

KPostOffice commented Nov 18, 2024

KPostOffice commented Nov 18, 2024

varshaprasad96 commented Nov 19, 2024 •

edited

Loading

KPostOffice commented Nov 19, 2024 •

edited

Loading

Local queues prometheus metrics #1833

Local queues prometheus metrics #1833

Comments

astefanutti commented Mar 13, 2024 • edited Loading

astefanutti commented Mar 13, 2024

alculquicondor commented Mar 13, 2024 • edited Loading

tenzen-y commented Mar 13, 2024

astefanutti commented Mar 13, 2024

alculquicondor commented Mar 13, 2024

tenzen-y commented Mar 13, 2024

astefanutti commented Mar 13, 2024

k8s-triage-robot commented Jun 11, 2024

alculquicondor commented Jun 24, 2024

astefanutti commented Jun 25, 2024

astefanutti commented Jun 25, 2024

astefanutti commented Jun 25, 2024

alculquicondor commented Jul 2, 2024

varshaprasad96 commented Jul 10, 2024

k8s-triage-robot commented Oct 8, 2024

tenzen-y commented Oct 8, 2024

varshaprasad96 commented Oct 8, 2024

tenzen-y commented Oct 8, 2024

KPostOffice commented Nov 18, 2024

KPostOffice commented Nov 18, 2024

KPostOffice commented Nov 18, 2024

varshaprasad96 commented Nov 19, 2024 • edited Loading

KPostOffice commented Nov 19, 2024 • edited Loading

astefanutti commented Mar 13, 2024 •

edited

Loading

alculquicondor commented Mar 13, 2024 •

edited

Loading

varshaprasad96 commented Nov 19, 2024 •

edited

Loading

KPostOffice commented Nov 19, 2024 •

edited

Loading