Issue with Counter Metrics for Short-Lived Celery Workers in Kubernetes #336

lancewl · 2024-11-27T00:11:36Z

Hi there,

First, I want to express my gratitude for developing this amazing exporter. It has been incredibly helpful in monitoring our systems.

Issue Description:

I am encountering an issue with incorrect numbers appearing in my Grafana dashboards. After investigating, I discovered that the root cause is related to how the increase() and rate() functions in Prometheus handle new counter metrics. Specifically, these functions do not account for counters that transition from a non-existent state to a value of 1. (for more details, please refer to: prometheus/prometheus#1673)

Context:

Environment: I am running Celery workers in a Kubernetes cluster with an autoscaler.
Challenge: The autoscaler frequently creates short-lived Celery workers, each with a unique hostname. As a result, I have many metrics with different hostnames, each starting with a counter value of 1 when it first receive a task.
Problem: Prometheus struggles to accurately calculate these new metrics because it cannot distinguish between new and stale counters.

Attempts to Resolve:

I have tried various workarounds by modifying the PromQL expressions, but none have successfully addressed the issue.

Proposed Solution:

To improve accuracy, it might be beneficial for the exporter to send an initial value of 0 for all counter metrics when it detects new workers. This would help Prometheus correctly interpret the metrics as new rather than stale.

Thank you for considering this enhancement. I believe it would greatly improve the accuracy of monitoring in dynamic environments like ours.

Best,
Lance

danihodovic · 2024-11-27T15:20:31Z

Thanks for the kind words. Would you like to open a pull-request?

What is your use-case for consistently scaling the number of workers up and down? Why not run a fixed set of workers?

lancewl · 2024-11-27T16:48:02Z

Sure, I'll aim to open a PR next week when I have more bandwidth!

The primary reason for implementing an auto-scaler for workers is that our workload tends to peak at specific times throughout the day. Instead of running a large number of workers continuously to manage these peak periods, we can scale up only during those times. This approach optimizes resource usage and efficiency.

danihodovic · 2024-11-27T17:19:20Z

The primary reason for implementing an auto-scaler for workers is that our workload tends to peak at specific times throughout the day. Instead of running a large number of workers continuously to manage these peak periods, we can scale up only during those times. This approach optimizes resource usage and efficiency.

How many workers are you running at min / max?

To avoid generating hostnames you can run Celery workers as statefulsets rather than deployments. That's what I do.

lancewl · 2024-11-27T17:29:49Z

At the moment, I have a minimum of 4 and a maximum of 16. While a StatefulSet seems appropriate, I believe that scaling up and down might still lead to issues with workers that aren't running continuously. I think the proposed solution could be beneficial for other use cases that have static hostnames, as it addresses the problem of the first task received by a worker not being counted correctly.

danihodovic · 2024-11-27T20:57:50Z

At the moment, I have a minimum of 4 and a maximum of 16.

👍 do you use some kind of custom queue-based scaling in K8s? I'm interested in a solution myself for the project I'm working on

lancewl · 2024-11-27T21:03:29Z

For now, I'm simply scaling based on the cpu usage of pods. I'm also exploring custom solutions to scale based on the queue size. I guess I'll probably deploy a simple python script to read the queue and publish the size of queue to k8s for auto scaling.

danihodovic · 2024-11-27T21:25:55Z

#72 may be related

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Counter Metrics for Short-Lived Celery Workers in Kubernetes #336

Issue with Counter Metrics for Short-Lived Celery Workers in Kubernetes #336

lancewl commented Nov 27, 2024 •

edited

Loading

danihodovic commented Nov 27, 2024 •

edited

Loading

lancewl commented Nov 27, 2024

danihodovic commented Nov 27, 2024

lancewl commented Nov 27, 2024

danihodovic commented Nov 27, 2024

lancewl commented Nov 27, 2024

danihodovic commented Nov 27, 2024

Issue with Counter Metrics for Short-Lived Celery Workers in Kubernetes #336

Issue with Counter Metrics for Short-Lived Celery Workers in Kubernetes #336

Comments

lancewl commented Nov 27, 2024 • edited Loading

Issue Description:

Context:

Attempts to Resolve:

Proposed Solution:

danihodovic commented Nov 27, 2024 • edited Loading

lancewl commented Nov 27, 2024

danihodovic commented Nov 27, 2024

lancewl commented Nov 27, 2024

danihodovic commented Nov 27, 2024

lancewl commented Nov 27, 2024

danihodovic commented Nov 27, 2024

lancewl commented Nov 27, 2024 •

edited

Loading

danihodovic commented Nov 27, 2024 •

edited

Loading