Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add downgrade cancellation e2e tests #19252

Merged

Conversation

henrybear327
Copy link
Contributor

@henrybear327 henrybear327 commented Jan 21, 2025

Part 2 of the e2e test - downgrade cancellation is called after the cluster has been partially downgraded (refer to part 1: #19244)

Reference: #17976

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Copy link

codecov bot commented Jan 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.82%. Comparing base (3cc3daf) to head (4d1a207).
Report is 12 commits behind head on main.

Additional details and impacted files

see 25 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19252      +/-   ##
==========================================
- Coverage   68.90%   68.82%   -0.08%     
==========================================
  Files         420      420              
  Lines       35706    35706              
==========================================
- Hits        24602    24576      -26     
- Misses       9683     9705      +22     
- Partials     1421     1425       +4     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3cc3daf...4d1a207. Read the comment docs.

@henrybear327 henrybear327 force-pushed the e2e/downgrade_cancellation_partial branch 3 times, most recently from 7afa927 to 9da71dc Compare January 28, 2025 12:34
@henrybear327 henrybear327 force-pushed the e2e/downgrade_cancellation_partial branch from 9da71dc to 6004df8 Compare January 28, 2025 13:00
@henrybear327 henrybear327 marked this pull request as ready for review January 28, 2025 13:00
@henrybear327
Copy link
Contributor Author

/retest

@henrybear327 henrybear327 force-pushed the e2e/downgrade_cancellation_partial branch from 6004df8 to b88f908 Compare January 28, 2025 13:51
@henrybear327
Copy link
Contributor Author

Rebased to main as #19244 is merged

@henrybear327
Copy link
Contributor Author

/retest

@ahrtr
Copy link
Member

ahrtr commented Jan 28, 2025

@siyuanfoundation PTAL, thx

@henrybear327 henrybear327 force-pushed the e2e/downgrade_cancellation_partial branch 4 times, most recently from 6359e49 to 5f6fd54 Compare February 3, 2025 20:38
@henrybear327
Copy link
Contributor Author

The more the number of nodes in the cluster is downgraded, the easier it is to cause the downgrade cancellation to timeout.

I don't think I have a better idea of reducing flakiness so far, so I added this retry mechanism. Might not be elegant, but at least the CI tests should be somewhat stable.

Tested with this command PASSES="release e2e" PKG=./tests/e2e TESTCASE="TestDowngradeCancellationAfterDowngrading4InClusterOf5" TIMEOUT=40m ./scripts/test.sh -v -count=50 -failfast, as this test usually times out on my machine every 5 runs or so.

@henrybear327
Copy link
Contributor Author

/retest

1 similar comment
@henrybear327
Copy link
Contributor Author

/retest

tests/framework/e2e/downgrade.go Outdated Show resolved Hide resolved
tests/framework/e2e/downgrade.go Outdated Show resolved Hide resolved
@henrybear327 henrybear327 force-pushed the e2e/downgrade_cancellation_partial branch 4 times, most recently from 73e07fd to b49c70b Compare February 4, 2025 09:43
@henrybear327
Copy link
Contributor Author

/retest

@siyuanfoundation
Copy link
Contributor

/retest

1 similar comment
@henrybear327
Copy link
Contributor Author

/retest

@henrybear327 henrybear327 force-pushed the e2e/downgrade_cancellation_partial branch 2 times, most recently from 36f4722 to a15bcbd Compare February 4, 2025 17:20
Signed-off-by: Chun-Hung Tseng <[email protected]>
Signed-off-by: Siyuan Zhang <[email protected]>
@henrybear327 henrybear327 force-pushed the e2e/downgrade_cancellation_partial branch from a15bcbd to 4d1a207 Compare February 4, 2025 17:31
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Overall looks good to me.

After the downgrade is cancelled, users should be able to restore the cluster to its original state. They just need to replace the binary to previous version for each downgraded member. I think we should cover it as well. Of course, it can be addressed in followup PRs.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, henrybear327, siyuanfoundation

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@henrybear327
Copy link
Contributor Author

Thanks for working on this. Overall looks good to me.

After the downgrade is cancelled, users should be able to restore the cluster to its original state. They just need to replace the binary to previous version for each downgraded member. I think we should cover it as well. Of course, it can be addressed in followup PRs.

I can quickly work on this.

Could you elaborate regarding the "replace the binary to the previous version for each downgraded member"?

Thank you @ahrtr!

@ahrtr
Copy link
Member

ahrtr commented Feb 4, 2025

Could you elaborate regarding the "replace the binary to the previous version for each downgraded member"?

For example, for a 3 member cluster of 3.6. You downloaded 2 of them to 3.5, then cancelled the downgrade. The cluster version should be 3.5.

The recovery process is straightforward, just replace binary for the 2 downloaded member one by one,

  • stop the member;
  • replace binary (using 3.6 binary in this example)
  • start the member

When it's done, the cluster version should be 3.6, and the cluster should be able to serve client requests correctly.

@henrybear327
Copy link
Contributor Author

Could you elaborate regarding the "replace the binary to the previous version for each downgraded member"?

For example, for a 3 member cluster of 3.6. You downloaded 2 of them to 3.5, then cancelled the downgrade. The cluster version should be 3.5.

The recovery process is straightforward, just replace binary for the 2 downloaded member one by one,

  • stop the member;
  • replace binary (using 3.6 binary in this example)
  • start the member

When it's done, the cluster version should be 3.6, and the cluster should be able to serve client requests correctly.

Maybe I am mistaken. Isn't this what we are doing in DowngradeUpgradeMembers now? :)

@ahrtr
Copy link
Member

ahrtr commented Feb 4, 2025

You can reuse DowngradeUpgradeMembers. The goal is to extend the e2e test case to verify that users are able to recover the partially downgraded cluster to its original state.

@ahrtr ahrtr merged commit 3842a5a into etcd-io:main Feb 4, 2025
36 checks passed
@henrybear327 henrybear327 deleted the e2e/downgrade_cancellation_partial branch February 4, 2025 19:51
@henrybear327
Copy link
Contributor Author

henrybear327 commented Feb 4, 2025

You can reuse DowngradeUpgradeMembers. The goal is to extend the e2e test case to verify that users are able to recover the partially downgraded cluster to its original state.

I will submit a follow-up PR and see if it meets your expectation! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants