-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Counter Metrics for Short-Lived Celery Workers in Kubernetes #336
Comments
Thanks for the kind words. Would you like to open a pull-request? What is your use-case for consistently scaling the number of workers up and down? Why not run a fixed set of workers? |
Sure, I'll aim to open a PR next week when I have more bandwidth! The primary reason for implementing an auto-scaler for workers is that our workload tends to peak at specific times throughout the day. Instead of running a large number of workers continuously to manage these peak periods, we can scale up only during those times. This approach optimizes resource usage and efficiency. |
How many workers are you running at min / max? To avoid generating hostnames you can run Celery workers as statefulsets rather than deployments. That's what I do. |
At the moment, I have a minimum of 4 and a maximum of 16. While a StatefulSet seems appropriate, I believe that scaling up and down might still lead to issues with workers that aren't running continuously. I think the proposed solution could be beneficial for other use cases that have static hostnames, as it addresses the problem of the first task received by a worker not being counted correctly. |
👍 do you use some kind of custom queue-based scaling in K8s? I'm interested in a solution myself for the project I'm working on |
For now, I'm simply scaling based on the cpu usage of pods. I'm also exploring custom solutions to scale based on the queue size. I guess I'll probably deploy a simple python script to read the queue and publish the size of queue to k8s for auto scaling. |
#72 may be related |
Hi there,
First, I want to express my gratitude for developing this amazing exporter. It has been incredibly helpful in monitoring our systems.
Issue Description:
I am encountering an issue with incorrect numbers appearing in my Grafana dashboards. After investigating, I discovered that the root cause is related to how the increase() and rate() functions in Prometheus handle new counter metrics. Specifically, these functions do not account for counters that transition from a non-existent state to a value of 1. (for more details, please refer to: prometheus/prometheus#1673)
Context:
Attempts to Resolve:
I have tried various workarounds by modifying the PromQL expressions, but none have successfully addressed the issue.
Proposed Solution:
To improve accuracy, it might be beneficial for the exporter to send an initial value of 0 for all counter metrics when it detects new workers. This would help Prometheus correctly interpret the metrics as new rather than stale.
Thank you for considering this enhancement. I believe it would greatly improve the accuracy of monitoring in dynamic environments like ours.
Best,
Lance
The text was updated successfully, but these errors were encountered: