Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky TestRuntimeReconfigGrowClusterSize #19216

Open
ahrtr opened this issue Jan 17, 2025 · 3 comments
Open

Flaky TestRuntimeReconfigGrowClusterSize #19216

ahrtr opened this issue Jan 17, 2025 · 3 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Jan 17, 2025

Which Github Action / Prow Jobs are flaking?

/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdctl (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdctl_--endpoints=http://localhost:20000,http://localhost:20005_member_promote_5d2dab3fee7266f0_-w_json) (58870): {"level":"warn","ts":"2025-01-17T04:53:28.794025Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x9e88b48/localhost:20000","method":"/etcdserverpb.Cluster/MemberPromote","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdctl (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdctl_--endpoints=http://localhost:20000,http://localhost:20005_member_promote_5d2dab3fee7266f0_-w_json) (58870): Error: etcdserver: can only promote a learner member which is in sync with leader
    runtime_reconfiguration_test.go:181: 
        	Error Trace:	/home/prow/go/src/github.com/etcd-io/etcd/tests/e2e/runtime_reconfiguration_test.go:181
        	            				/home/prow/go/src/github.com/etcd-io/etcd/tests/e2e/runtime_reconfiguration_test.go:78
        	Error:      	Received unexpected error:
        	            	match not found.  Set EXPECT_DEBUG for more info Errs: [unexpected exit code [1] after running [/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdctl --endpoints=http://localhost:20000,http://localhost:20005 member promote 5d2dab3fee7266f0 -w json]], last lines:
        	            	{"level":"warn","ts":"2025-01-17T04:53:28.794025Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x9e88b48/localhost:20000","method":"/etcdserverpb.Cluster/MemberPromote","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
        	            	Error: etcdserver: can only promote a learner member which is in sync with leader
        	Test:       	TestRuntimeReconfigGrowClusterSize/grow_cluster_size_from_1_to_3_with_learner
    logger.go:146: 2025-01-17T04:53:28.799Z	INFO	closing test cluster...
    logger.go:146: 2025-01-17T04:53:28.799Z	INFO	stopping server...	{"name": "TestRuntimeReconfigGrowClusterSizegrow_cluster_size_from_1_to_3_with_learner-test-0"}

Refer to https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/19198/pull-etcd-e2e-386/1880111907646476288

Which tests are flaking?

.

Github Action / Prow Job link

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

@jmao-dd
Copy link
Contributor

jmao-dd commented Jan 20, 2025

@ahrtr Please let me know if my analysis makes sense.

From the timestamps in the following logs, it seems the time between the new learner becomes ready and it is promoted is only 18ms, which may be too short for the learner to catch up.

/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRuntimeReconfigGrowClusterSizegrow_cluster_size_from_1_to_3_with_learner-test-2) (58855): {"level":"info","ts":"2025-01-17T04:53:28.776278Z","caller":"v3rpc/health.go:63","msg":"grpc service status changed","service":"","status":"SERVING"}
...
/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRuntimeReconfigGrowClusterSizegrow_cluster_size_from_1_to_3_with_learner-test-2) (58855): {"level":"info","ts":"2025-01-17T04:53:28.779439Z","caller":"etcdserver/server.go:519","msg":"initialized peer connections; fast-forwarding election ticks","local-member-id":"5d2dab3fee7266f0","forward-ticks":8,"forward-duration":"800ms","election-ticks":10,"election-timeout":"1s","active-remote-members":2}
/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRuntimeReconfigGrowClusterSizegrow_cluster_size_from_1_to_3_with_learner-test-0) (58777): {"level":"error","ts":"2025-01-17T04:53:28.793455Z","caller":"etcdserver/server.go:1582","msg":"rejecting promote learner: learner is not ready","learner-ready-percent":0,"ready-percent-threshold":0.9,"stacktrace":..."}
/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdctl (/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdctl_--endpoints=http://localhost:20000,http://localhost:20005_member_promote_5d2dab3fee7266f0_-w_json) (58870): {"level":"warn","ts":"2025-01-17T04:53:28.794025Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:65","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x9e88b48/localhost:20000","method":"/etcdserverpb.Cluster/MemberPromote","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}

In the test code theMemberPromote call is made right after epc.StartNewProc. Maybe we should add some retry on ErrLearnerNotReady with a small amount of sleep between retries?

@ahrtr
Copy link
Member Author

ahrtr commented Jan 21, 2025

Thanks @jmao-dd .

Please feel free to deliver a PR. Let's retry several times (i.e. 3) if the error contains "can only promote a learner member which is in sync with leade"

@jmao-dd
Copy link
Contributor

jmao-dd commented Jan 25, 2025

Created a PR #19272 for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants