-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler reports an error EOF when sending alert to alertmanager #4958
Comments
I have been recommended to fill up a bug from Cortex slack channel so here we go. Is there a known way to mitigate this issue ? I have setup alertmanager in cluster mode now but I am still looking for a way to verify that all alerts have been posted correctly (despite EOF). |
Thanks for filing this issue. Do you have any gateway in front of alert manager where you can tune the idle connection timeout to be bigger than 5 minutes like mentioned in in this comment: prometheus/prometheus#9057 (comment) |
n/m @AlexandreRoux I saw your comment in the prometheus issue, and it sounds like modifying alertmanager server side connection idle timeout is not feasible for you? |
We've been talking in slack. I think I have had this problem for a while, but was ignoring it because the alerts are eventually sent. I discovered yesterday the issue is gone when I activated alertmanager sharding My current solution: Removed this
Added this:
Note: Unfortunately this is not possible for @AlexandreRoux, because he can't enable alertmanager sharding, he is using
|
Rollback to no sharding and use of alertmanager gossip. I discovered my pod wasn't exposing 9094 tcp port correctly. I solved the problem deleting the statefulset and recreating it for alertmanager |
@alvinlin123 - I apologies for the delay to reply here. Indeed as @friedrichg mentioned It seems that 9094 is not expose correctly on my side too.
In https://github.com/cortexproject/cortex-helm-chart it seems we are missing ref to port 9094
I will bring this forward to https://github.com/cortexproject/cortex-helm-chart and help in improving the charts. |
@nschad - as FYI, will maybe open a bug and start to work on getting 9094 added in https://github.com/cortexproject/cortex-helm-chart |
@alvinlin123 / @friedrichg - I was able to catch again some times to troubleshoot my EOF with the ruler / alertmanager and unfortunately for me the issue is still present after fixing port 9094 (TCP + UDP) exposure. Here is all details if you are interested : |
Describe the bug
Cortex ruler logs are showing an error EOF when posting alert to Cortex alertmanager.
notifier.go is Prometheus code and could miss a
req.Close = true
as pointed out here.Bug report exist in Prometheus repo :
The number of file descriptor used by alertmanager process is < default limit of alertmanager file descriptor in my case.
This bug does not seems to be tight to a specific alert and happen randomly.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
alertmanager should receive all POST alerts correctly
Environment:
The text was updated successfully, but these errors were encountered: