Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch API. blocked forever even with WithRequireLeader option. #12534

Closed
timestee opened this issue Dec 9, 2020 · 7 comments
Closed

Watch API. blocked forever even with WithRequireLeader option. #12534

timestee opened this issue Dec 9, 2020 · 7 comments

Comments

@timestee
Copy link

timestee commented Dec 9, 2020

Perhaps duplicated with this issue : #12022 but even with WithRequireLeader option.

When the connection is disconnected, due to a restart of the etcd server, sometimes reading from the watch channel is blocked forever even with WithRequireLeader option, Just got these error log:

{"level":"warn","ts":"2020-12-09T20:41:09.730+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-df29e0f0-3bb7-4d66-a7ba-d2b2e4850b07/127.0.0.1:12379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:12379: connect: connection refused\""}
{"level":"warn","ts":"2020-12-09T20:46:14.739+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-df29e0f0-3bb7-4d66-a7ba-d2b2e4850b07/127.0.0.1:12379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:32379: connect: connection refused\""}

Code snippet:

// wc is a valid Watcher
go func() {
	defer func() { _ = wc.Close() }()
	defer close(respCh)

	// Push the current value through the channel.
	respCh <- pairs

	rev++
	ctx := context.Background()
	ctx = clientv3.WithRequireLeader(ctx)
	watchCh := wc.Watch(ctx, v.normalize(directory), clientv3.WithPrefix(), clientv3.WithRev(rev))
	for {
		// Check if the watch was stopped by the caller
		select {
		case <-stopCh:
			return
		case <-watchCh:
			_, pairs, err := v.list(directory, opts)
			if err != nil {
				return
			}
			respCh <- pairs
		}
	}
}()

golang client version : v3.3.0-rc.0.0.20200720071305-89da79188f73
server version : etcd Version: 3.4.13

@agargi
Copy link
Contributor

agargi commented Dec 27, 2020

@timestee Whether a leader exists in a cluster is only checked at the time of creating a Watch. This check is true for any gRPC call given this check is done only in Interceptor and not afterwards. IIUC, what you are asking can be a future enhancement (but I assume this needs to be done in a way where it does not change existing behavior but provides a new capability to cancel watch if there is no leader).

kubernetes/kubernetes#89488

@timestee
Copy link
Author

@agargi you got my point, etcd should cancel watch or return an error if leader lost, otherwise, when the leader lost, the service provider will keep alive failed, what should the service provider do?

  • Should the service provider shutdown itself? But the watcher can not be notified that the leader lost , it just blocked.
  • Should the service provider ignore the keep alive failed error? But when the etcd server comes back, the watcher will be notified that all registered service lost because of TTL. The service provider should auto register again now? a bit messy i think.

@embano1
Copy link

embano1 commented Jan 18, 2021

Went through the same, here's my conclusion from a while ago: kubernetes/kubernetes#89488 (comment)

@stale
Copy link

stale bot commented Apr 18, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 18, 2021
@stale stale bot removed the stale label Apr 19, 2021
@xiang90
Copy link
Contributor

xiang90 commented Apr 19, 2021

It is a nice to have feature. Probably etcd should check if the server maintains leadership for long running requests.

@stale
Copy link

stale bot commented Jul 18, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@346999452
Copy link

the client code can cause this result in extreme scenarios, and I have reported it as a bug:
#17611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants