Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STREAM: Failed to start: failed to join Raft group after updating kubernetes version #44

Open
tahsinrahman opened this issue Jun 27, 2019 · 6 comments

Comments

@tahsinrahman
Copy link

tahsinrahman commented Jun 27, 2019

we're running nats streaming operator in gke, it was running fine. But after we updated kubernetes version. nats clients are not able to connect to nats server.

$ kubectl get natsclusters.nats.io nats-cluster -o yaml
apiVersion: nats.io/v1alpha2
kind: NatsCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"nats.io/v1alpha2","kind":"NatsCluster","metadata":{"annotations":{},"name":"nats-cluster","namespace":"default"},"spec":{"size":3}}
  creationTimestamp: "2019-05-09T05:56:01Z"
  generation: 4
  name: nats-cluster
  namespace: default
  resourceVersion: "34972339"
  selfLink: /apis/nats.io/v1alpha2/namespaces/default/natsclusters/nats-cluster
  uid: 1f98955b-721f-11e9-b8dc-42010a8001a9
spec:
  size: 3
status:
  conditions:
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T11:04:40Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T11:05:53Z"
    type: Ready
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T11:05:54Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T11:06:06Z"
    type: Ready
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T19:47:43Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T19:48:12Z"
    type: Ready
  - reason: scaling cluster from 1 to 3 peers
    transitionTime: "2019-06-26T19:51:14Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T19:52:07Z"
    type: Ready
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T19:57:13Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T19:57:38Z"
    type: Ready
  currentVersion: 1.4.0
  size: 3
$ kubectl get natsstreamingclusters.streaming.nats.io pharmer-cluster -o yaml
apiVersion: streaming.nats.io/v1alpha1
kind: NatsStreamingCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"streaming.nats.io/v1alpha1","kind":"NatsStreamingCluster","metadata":{"annotations":{},"name":"pharmer-cluster","namespace":"default"},"spec":{"natsSvc":"nats-cluster","size":3}}
  creationTimestamp: "2019-05-09T05:58:23Z"
  generation: 1
  name: pharmer-cluster
  namespace: default
  resourceVersion: "19344769"
  selfLink: /apis/streaming.nats.io/v1alpha1/namespaces/default/natsstreamingclusters/pharmer-cluster
  uid: 747d5286-721f-11e9-b8dc-42010a8001a9
spec:
  natsSvc: nats-cluster
  size: 3
$ kubectl logs -f pharmer-cluster-1
[1] 2019/06/27 04:43:47.580129 [INF] STREAM: Starting nats-streaming-server[pharmer-cluster] version 0.11.2
[1] 2019/06/27 04:43:47.580191 [INF] STREAM: ServerID: ul5c5zg9XBBSq3PMBuRn4j
[1] 2019/06/27 04:43:47.580196 [INF] STREAM: Go version: go1.11.1
[1] 2019/06/27 04:43:47.604905 [INF] STREAM: Recovering the state...
[1] 2019/06/27 04:43:47.605177 [INF] STREAM: No recovered state
[1] 2019/06/27 04:43:47.605272 [INF] STREAM: Cluster Node ID : "pharmer-cluster-1"
[1] 2019/06/27 04:43:47.605287 [INF] STREAM: Cluster Log Path: pharmer-cluster/"pharmer-cluster-1"
[1] 2019/06/27 04:43:52.714067 [INF] STREAM: Shutting down.
[1] 2019/06/27 04:43:52.714455 [FTL] STREAM: Failed to start: failed to join Raft group pharmer-cluster

From nats client

$ kubectl logs -f pharmer-clsuter-868bd8d465-xwv77
stan: connect request timeout
$ kubectl logs -f nats-cluster-1                                                                                                                                                                             
[1] 2019/06/26 19:51:55.159570 [INF] Starting nats-server version 1.4.0                                                                                                                                             
[1] 2019/06/26 19:51:55.159628 [INF] Git commit [ce2df36]                                                                                                                                                           
[1] 2019/06/26 19:51:55.159856 [INF] Starting http monitor on 0.0.0.0:8222                                                                                                                                          
[1] 2019/06/26 19:51:55.159907 [INF] Listening for client connections on 0.0.0.0:4222                                                                                                                               
[1] 2019/06/26 19:51:55.159915 [INF] Server is ready                                                                                                                                                                
[1] 2019/06/26 19:51:55.160129 [INF] Listening for route connections on 0.0.0.0:6222                                                                                                                                
[1] 2019/06/26 19:51:55.169644 [INF] 10.60.3.20:6222 - rid:1 - Route connection created                                                                              
[1] 2019/06/26 19:51:55.173249 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-1.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:55.173271 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:56.183583 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:56.183612 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-1.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:56.565355 [INF] 10.60.3.20:52526 - rid:2 - Route connection created
[1] 2019/06/26 19:51:57.190034 [INF] 10.60.1.16:53246 - rid:4 - Route connection created
[1] 2019/06/26 19:51:57.190107 [INF] 10.60.1.16:6222 - rid:3 - Route connection created
[1] 2019/06/26 19:51:57.195326 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:58.207202 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:59.227256 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:00.240231 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:01.251089 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:02.262939 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:03.273854 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:03.409542 [INF] 10.60.1.19:51194 - rid:5 - Route connection created
[1] 2019/06/26 19:52:04.291668 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:05.305117 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:06.348944 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:07.360903 [INF] 10.60.1.19:6222 - rid:10 - Route connection created
[1] 2019/06/26 19:55:57.369605 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:55:59.370067 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:01.370584 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:03.371988 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:05.372566 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:07.373118 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:09.373677 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:11.374126 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:13.374572 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:15.375083 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:17.375585 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:19.376117 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:21.376767 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:23.377261 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:25.377780 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:26.384664 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:27.390153 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:28.397100 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:29.402200 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:30.407839 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:31.417194 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:32.423944 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:33.429647 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:34.436051 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:35.441762 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:36.452196 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:37.458081 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
@tahsinrahman
Copy link
Author

tahsinrahman commented Jun 27, 2019

If i scale down nats streaming cluster to size 1, the pod pharmer-cluster-1 runs.

$ kubectl logs -f pharmer-cluster-1
[1] 2019/06/27 05:30:26.817748 [INF] STREAM: Starting nats-streaming-server[pharmer-cluster] version 0.12.2
[1] 2019/06/27 05:30:26.817803 [INF] STREAM: ServerID: pleessm07DDglxq2tYFk5b
[1] 2019/06/27 05:30:26.817808 [INF] STREAM: Go version: go1.11.6
[1] 2019/06/27 05:30:26.817811 [INF] STREAM: Git commit: [4489c46]
[1] 2019/06/27 05:30:26.835108 [INF] STREAM: Recovering the state...
[1] 2019/06/27 05:30:26.835336 [INF] STREAM: No recovered state
[1] 2019/06/27 05:30:27.086859 [INF] STREAM: Message store is FILE
[1] 2019/06/27 05:30:27.086882 [INF] STREAM: Store location: store
[1] 2019/06/27 05:30:27.086941 [INF] STREAM: ---------- Store Limits ----------
[1] 2019/06/27 05:30:27.086945 [INF] STREAM: Channels:                  100 *
[1] 2019/06/27 05:30:27.086948 [INF] STREAM: --------- Channels Limits --------
[1] 2019/06/27 05:30:27.086951 [INF] STREAM:   Subscriptions:          1000 *
[1] 2019/06/27 05:30:27.086954 [INF] STREAM:   Messages     :       1000000 *
[1] 2019/06/27 05:30:27.086956 [INF] STREAM:   Bytes        :     976.56 MB *
[1] 2019/06/27 05:30:27.086959 [INF] STREAM:   Age          :     unlimited *
[1] 2019/06/27 05:30:27.086962 [INF] STREAM:   Inactivity   :     unlimited *
[1] 2019/06/27 05:30:27.086964 [INF] STREAM: ----------------------------------
[1] 2019/06/27 05:31:04.613144 [INF] STREAM: Channel "create-cluster" has been created
[1] 2019/06/27 05:31:04.616178 [INF] STREAM: Channel "delete-cluster" has been created
[1] 2019/06/27 05:31:04.618912 [INF] STREAM: Channel "retry-cluster" has been created

but when i scale up to size 2/3, the first pod remains running, but other pods fails

$ kubectl logs -f pharmer-cluster-2
[1] 2019/06/27 05:39:05.591052 [INF] STREAM: Starting nats-streaming-server[pharmer-cluster] version 0.12.2
[1] 2019/06/27 05:39:05.591103 [INF] STREAM: ServerID: EZj3gVhakNMkk7k81uEP9K
[1] 2019/06/27 05:39:05.591107 [INF] STREAM: Go version: go1.11.6
[1] 2019/06/27 05:39:05.591110 [INF] STREAM: Git commit: [4489c46]
[1] 2019/06/27 05:39:05.610442 [INF] STREAM: Recovering the state...
[1] 2019/06/27 05:39:05.610731 [INF] STREAM: No recovered state
[1] 2019/06/27 05:39:05.610840 [INF] STREAM: Cluster Node ID : "pharmer-cluster-2"
[1] 2019/06/27 05:39:05.610846 [INF] STREAM: Cluster Log Path: pharmer-cluster/"pharmer-cluster-2"
[1] 2019/06/27 05:39:10.720020 [INF] STREAM: Shutting down.
[1] 2019/06/27 05:39:10.720534 [FTL] STREAM: Failed to start: failed to join Raft group pharmer-cluster

@VFe
Copy link

VFe commented Aug 1, 2019

Having this exact same issue on EKS, except it happens without any upgrade. What kubernetes version did you upgrade to?(I'm on 1.12 experiencing this)

@tahsinrahman
Copy link
Author

we upgraded to 1.13.6-gke.13

@VFe
Copy link

VFe commented Aug 5, 2019

Been having this issue on 1.11, 1.12, and 1.13 in EKS. Been digging and not really sure what the root cause is.

@charles-d-burton
Copy link

I'm having this issue on 1.14 with K3S as well.

@orefalo
Copy link

orefalo commented May 6, 2020

same with k3s v1.17.4+k3s1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants