Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler reports an error EOF when sending alert to alertmanager #4958

Open
humblebundledore opened this issue Nov 8, 2022 · 8 comments
Open
Labels

Comments

@humblebundledore
Copy link

humblebundledore commented Nov 8, 2022

Describe the bug
Cortex ruler logs are showing an error EOF when posting alert to Cortex alertmanager.

level=error caller=notifier.go:527 user=tenant-one alertmanager=http://cortex-alertmanager.cortex.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-alertmanager.cortex.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"

notifier.go is Prometheus code and could miss a req.Close = true as pointed out here.

Bug report exist in Prometheus repo :

The number of file descriptor used by alertmanager process is < default limit of alertmanager file descriptor in my case.
This bug does not seems to be tight to a specific alert and happen randomly.

To Reproduce
Steps to reproduce the behavior:

  1. setup ruler / alertmanager
  2. send alerts from ruler to alertmanager up to get EOF

Expected behavior
alertmanager should receive all POST alerts correctly

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm Cortex (1.6.0), Cortex (v1.13.0)
@humblebundledore
Copy link
Author

I have been recommended to fill up a bug from Cortex slack channel so here we go.

Is there a known way to mitigate this issue ? I have setup alertmanager in cluster mode now but I am still looking for a way to verify that all alerts have been posted correctly (despite EOF).

@alvinlin123
Copy link
Contributor

Thanks for filing this issue. Do you have any gateway in front of alert manager where you can tune the idle connection timeout to be bigger than 5 minutes like mentioned in in this comment: prometheus/prometheus#9057 (comment)

@alvinlin123
Copy link
Contributor

n/m @AlexandreRoux I saw your comment in the prometheus issue, and it sounds like modifying alertmanager server side connection idle timeout is not feasible for you?

@friedrichg
Copy link
Member

We've been talking in slack. I think I have had this problem for a while, but was ignoring it because the alerts are eventually sent. I discovered yesterday the issue is gone when I activated alertmanager sharding

My current solution:

Removed this

        --alertmanager.cluster.listen-address=[$(POD_IP)]:9094
        --alertmanager.cluster.peers=alertmanager-0.alertmanager.namespace.svc.cluster.local:9094,alertmanager-1.alertmanager.namespace.svc.cluster.local:9094,alertmanager-2.alertmanager.namespace.svc.cluster.local:9094

Added this:

        -alertmanager.sharding-enabled=true
        -alertmanager.sharding-ring.replication-factor=3
        -alertmanager.sharding-ring.store=memberlist
        -memberlist.abort-if-join-fails=false
        -memberlist.bind-port=7946
        -memberlist.join=gossip-ring.namespace.svc.cluster.local:7946

Note: Unfortunately this is not possible for @AlexandreRoux, because he can't enable alertmanager sharding, he is using local backend for alertmanager

-alertmanager-storage.backend=local

@friedrichg
Copy link
Member

Rollback to no sharding and use of alertmanager gossip.

I discovered my pod wasn't exposing 9094 tcp port correctly.
There is a long standing open kubernetes bug that occurs when there is a port using udp and tcp in the same pod.
kubernetes/kubernetes#39188

I solved the problem deleting the statefulset and recreating it for alertmanager
let me know if this helps @AlexandreRoux

@humblebundledore
Copy link
Author

@alvinlin123 - I apologies for the delay to reply here.

Indeed as @friedrichg mentioned It seems that 9094 is not expose correctly on my side too.
In addition, I also noticed that there might be some missing configuration in the way helm chart is deploying Cortex https://github.com/cortexproject/cortex-helm-chart

$ k get services -n cortex-base | grep alertmanager
cortex-base-alertmanager              ClusterIP   10.xx.xx.201   <none>        8080/TCP   77d
cortex-base-alertmanager-headless     ClusterIP   None            <none>        8080/TCP   9d

$ k describe pods/cortex-base-alertmanager-0 -n cortex-base
    Ports:         8080/TCP, 7946/TCP

$ k describe statefulset/cortex-base-alertmanager -n cortex-base
    Ports:       8080/TCP, 7946/TCP
    Host Ports:  0/TCP, 0/TCP

$ kubectl exec -ti cortex-base-alertmanager-0 -c alertmanager -n cortex-base -- /bin/sh
/ # nc -zv 127.0.0.1:9094
127.0.0.1:9094 (127.0.0.1:9094) open
/ # nc -zv cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094
cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094 (10.xx.xx.119:9094) open

$ k logs -f -n cortex-base -l app.kubernetes.io/component=alertmanager -c alertmanager
level=debug ts=2022-12-13T07:26:33.456676073Z caller=cluster.go:337 component=cluster memberlist="2022/12/13 07:26:33 [DEBUG] memberlist: Initiating push/pull sync with: 01GKWRxxxxxxxxxQSDT73 10.xx.xx.223:9094\n"

In https://github.com/cortexproject/cortex-helm-chart it seems we are missing ref to port 9094
In https://github.com/cortexproject/cortex-jsonnet I am able to generate appropriate yaml file like

cortex-jsonnet/manifests ∙ grep -r "9094" ./ 
.//apps-v1.StatefulSet-alertmanager.yaml:        - --alertmanager.cluster.listen-address=[$(POD_IP)]:9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - --alertmanager.cluster.peers=alertmanager-0.alertmanager.default.svc.cluster.local:9094,alertmanager-1.alertmanager.default.svc.cluster.local:9094,alertmanager-2.alertmanager.default.svc.cluster.local:9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - containerPort: 9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - containerPort: 9094
.//v1.Service-alertmanager.yaml:    port: 9094
.//v1.Service-alertmanager.yaml:    targetPort: 9094
.//v1.Service-alertmanager.yaml:    port: 9094
.//v1.Service-alertmanager.yaml:    targetPort: 9094

I will bring this forward to https://github.com/cortexproject/cortex-helm-chart and help in improving the charts.
I think we are good to close here :)

@humblebundledore
Copy link
Author

@nschad - as FYI, will maybe open a bug and start to work on getting 9094 added in https://github.com/cortexproject/cortex-helm-chart

@humblebundledore
Copy link
Author

@alvinlin123 / @friedrichg - I was able to catch again some times to troubleshoot my EOF with the ruler / alertmanager and unfortunately for me the issue is still present after fixing port 9094 (TCP + UDP) exposure.

Here is all details if you are interested : 
cortexproject/cortex-helm-chart#420 (comment)
cortexproject/cortex-helm-chart#420 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants