Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of clearing metrics every X seconds (memory problem) #280

Open
gciria opened this issue Oct 31, 2023 · 3 comments
Open

Possibility of clearing metrics every X seconds (memory problem) #280

gciria opened this issue Oct 31, 2023 · 3 comments

Comments

@gciria
Copy link
Contributor

gciria commented Oct 31, 2023

I am using version v0.9.2, with the variables CE_WORKER_TIMEOUT and CE_PURGE_OFFLINE_WORKER_METRICS modified, the time was changed to 20 seconds.

In my structure every X minutes, several nodes in batches are started in Kubernetes with dozens of pods/celery consuming X queues.
Prometheus scrapes the metrics from the celery-exporter (9808/metrics) and stores them.
Apparently the purge variables don't work very well in my structure. In the logs I see purge of 1, 2 pods after many hours.

Would you like to know if there is a possibility to add a new parameter to purge all /metrics every X seconds? Or any tips for another solution.

image

Thanks and crongrats on the great project.

@danihodovic
Copy link
Owner

@adinhodovic

@adinhodovic
Copy link
Collaborator

If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?

Maybe CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true will help with cardinality aswell?

we dont have an option to clean all metrics atm.

@DvdChe
Copy link

DvdChe commented Jan 23, 2024

Hey,

I have same problem on my side,

I tried to activate CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true and some metrics has their hostname set as generic but there is still other that are labelled with pod name. I also tried to cutomize CE_PURGE_OFFLINE_WORKER_METRICS and CE_WORKER_TIMEOUT as well but there is no purge.

I tried to find how garbage collecting is working and I think i partially found the cause :

On my side, problem is that self.worker_last_seen remains empty and it never get updated so metrics are never purged.

If your workers go offline (rotate) metrics should be quickly cleaned up. Works fine for us with up to ~100 pods. On new releases all metrics get cleaned quite quickly. We do it every 5 minute and a worker times out at 2.5 minutes. You are not seeing the purge message enough?

Maybe CE_GENERIC_HOSTNAME_TASK_SENT_METRIC=true will help with cardinality aswell?

we dont have an option to clean all metrics atm.

What do you mean by go offline ? Is it a gracefull disconnection made by workers or something like that ? ( sorry for this question but I absolutely know nothing about celery )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants