-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource requests/limits updates, mainly to ingress-nginx and prometheus #2324
resource requests/limits updates, mainly to ingress-nginx and prometheus #2324
Conversation
…ry request to limit
… request to limit
This comment was marked as resolved.
This comment was marked as resolved.
3640d4f
to
a2b8cc6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good - and correctly provisioning our needed support infrastructure is definitely a good thing as can cause spurious issues like the one that prompted your investigation.
LGTM
Would this trigger core pool scale up in some places? There might be a slight risk of a flurry of pages if prometheus reschedules and then takes a while (11 mins) to come up if it has to move to a newly created node. Probably worth keeping an eye on the pagerduty-notifications and post to maintenance-notices before deploying.
Thank you for the review @pnasrat!!!
I think possibly! But even if we don't wait for a scalup the prometheus-server pod's will typically restart which can make us trigger pagerduty notifications. When they restart they can take several minutes, I think maybe about 1-10, before reading through a write-ahead log (WAL). I think during that time, they may be unresponsive as well! Also, when we now change the resource requests on the ingress-nginx pod, we will disrupt networking briefly when it restarts. |
Nod - my change in #2321 will increase the page window to 31 min of being down for prometheus which will stop noisy pages for restarts. The nginx reschedule worries me more, how might you release this in a way that doesn't impact hub users? |
When I think of it, I think this change may not be as disruptive as I thought initially:
I'm not sure about the consequences, but I think we should accept the risk. I could delay merging this and do work on the Sunday evening / monday Swedish morning as I've planned to perform maintenance on the LEAP cluster then, and then there may be less activity overall. |
Note I don't think this merits out of our deployment, merely thinking through the impact and understanding it. I think you are right on the deployment upgrade strategy https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment LGTM - go ahead and merge and monitor the deployment. |
🎉🎉🎉🎉 Monitor the deployment of the hubs here 👉 https://github.com/2i2c-org/infrastructure/actions/runs/4374529751 |
Thank you for your thorough review of this @pnasrat!! |
General motivation
By putting memory requests to match limits for prometheus-server, we shouldn't get memory pressure on the node where it runs due to prometheus-server any more I think. And like that, we won't get unrelated pod's evicted on the node which is what has happened to a critical ingress-nginx-controller pod in #2322.
If we would experience memory pressure on a node still for some reason, we now have memory requests for ingress-nginx that may very well be higher than what is used, which makes it not be considered for eviction.
Details about changes
Related