Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd - All grpc_code for grpc_method "Watch" is "Unavailable" #20311

Open
Reamer opened this issue Jul 13, 2018 · 22 comments
Open

etcd - All grpc_code for grpc_method "Watch" is "Unavailable" #20311

Reamer opened this issue Jul 13, 2018 · 22 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/master

Comments

@Reamer
Copy link

Reamer commented Jul 13, 2018

Hi,
I noticed, that every grpc_code for grpc_method "Watch" is "Unavailable" in my okd cluster. My plan is to monitor etcd-instances with default prometheus alerts from the etcd-project.
Maybe the watch-connection is not closed correctly and goes into an timeout.

Version
Client Version: 4.7.18
Server Version: 4.7.0-0.okd-2021-08-22-163618
Kubernetes Version: v1.20.0-1093+4593a24e8fd58d-dirty
Steps To Reproduce
  1. install okd 4.7
  2. Switch to etcd project oc project openshift-etcd
  3. Log in to the first etcd member oc rsh etcd-master1.mycompany.com
  4. curl -s --cacert "/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt" --cert "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.crt" --key "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.key" https://localhost:2379/metrics
Current Result
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434
Expected Result
grpc_server_handled_total{grpc_code="OK",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434

Additional Information

If that behavior is already fixed or it's a false positive, let me know.

@jwforres
Copy link
Member

@openshift/sig-master

@Reamer
Copy link
Author

Reamer commented Aug 9, 2018

Still present with 3.10

oc v3.10.0+0c4577e-1
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-cp-lb-01.cloud.example.de:443
openshift v3.10.0+7eee6f8-2
kubernetes v1.10.0+b81c8f8

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2018
@vsliouniaev
Copy link

+1 on this. We've disabled this alert on our setup because it's just flapping and not indicating any failures.

@Reamer
Copy link
Author

Reamer commented Nov 7, 2018

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 7, 2018
@gaopeiliang
Copy link

+1 on this , I also found it on etcd cluster master node , when add etcd3_alert.rules ..

image

it will cycle five mintue ... but we can't find something wrong with k8s ....

@gaopeiliang
Copy link

/remove-lifecycle stale

@arslanbekov
Copy link

arslanbekov commented Nov 29, 2018

+1.
I run etcd with debug log lever, and find this error:

etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 71; CANCEL")

errors about 1 time ~ in 5 minutes, stream ID - unique

etcd 3.2.24 / 3.2.25 / 3.3.10
Monitoring with prometheus (i getting this allert).

Any updates?

@judexzhu
Copy link

+1, ectd 3.3.10 with Prometheus Operator on Kubernetes 1.11.5

I have 5 nodes, but only one node having the alert, Others seem fine.

the etcd cluster runs well without issue.

image

@zqyangchn
Copy link

image

@zqyangchn
Copy link

/remove-lifecycle stale

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2019
@Reamer
Copy link
Author

Reamer commented May 20, 2019

Still reproducible on Origin 3.11

@Reamer
Copy link
Author

Reamer commented May 20, 2019

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2019
@Reamer
Copy link
Author

Reamer commented Jun 21, 2019

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2019
@Reamer
Copy link
Author

Reamer commented Sep 19, 2019

/remove-lifecycle stale
Still present

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2019
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2019
@Reamer
Copy link
Author

Reamer commented Dec 19, 2019

/lifecycle frozen
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2019
@openshift-ci-robot openshift-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 19, 2019
@hexfusion
Copy link
Contributor

/assign

@Joseph94m
Copy link

Any news about this ?

@Reamer
Copy link
Author

Reamer commented Oct 7, 2021

At the moment I am using okd 4.7 and this bug is still present.
Prometheus-Query:

grpc_server_handled_total{grpc_code="Unavailable",grpc_service="etcdserverpb.Watch"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/master
Projects
None yet
Development

No branches or pull requests