Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Tests] Update/revert kindest/node to v1.32.0 #11532

Merged
merged 4 commits into from
Mar 3, 2025

Conversation

LucaBernstein
Copy link
Member

@LucaBernstein LucaBernstein commented Feb 26, 2025

How to categorize this PR?

/area testing
/kind flake

What this PR does / why we need it:
Fix the flakes introduced with #11510.
Due to a racy locking behavior in go-cni, which has been included in containerd v2.0.2, upon the cni restart, our tests can get stuck and flake.

The image kindest/node v1.32.0 used in this PR is based on kind v0.25.0 includes both still-supported versions for runc (v1.1.13) and containerd (v1.7.18).

Which issue(s) this PR fixes:
Flakes:

Special notes for your reviewer:
/cc @ScheererJ

Release note:

NONE

@gardener-prow gardener-prow bot added area/testing Testing related kind/flake Tracking or fixing a flaky test cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels Feb 26, 2025
@gardener-prow gardener-prow bot requested review from acumino and plkokanov February 26, 2025 15:17
@gardener-prow gardener-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 26, 2025
@marc1404
Copy link
Member

🌧️ e2e-kind-upgrade failed with:

 - garden:statefulset/etcd: Unschedulable: 0/1 nodes available: 1 node is not ready
    - garden:pod/etcd-0: Unschedulable: 0/1 nodes available: 1 node is not ready
1/1 deployment(s) failed

@gardener-prow gardener-prow bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 27, 2025
@LucaBernstein LucaBernstein changed the title [Flaky Tests] Remove custom runc override [Flaky Tests] Update runc override to v1.2.5 Feb 27, 2025
@LucaBernstein
Copy link
Member Author

/test all

@LucaBernstein LucaBernstein changed the title [Flaky Tests] Update runc override to v1.2.5 [Flaky Tests] Update containerd to include fixed go-cni version Feb 28, 2025
@ScheererJ
Copy link
Member

/assign

@LucaBernstein
Copy link
Member Author

/test all

@LucaBernstein
Copy link
Member Author

pull-gardener-e2e-kind-migration-ha-single-zone fails because of gardener/etcd-backup-restore#847.

Copy link
Member

@marc1404 marc1404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! 😮 🦅

Just asking to be sure: Do we need to apply the containerd override for the cluster node(s) spawned through kind-up.sh as well?

gardener/hack/kind-up.sh

Lines 364 to 384 in 934b267

# workarounds for KinD issues
# TODO(marc1404): Remove once kindest/node uses runc >= v1.2.4
if [[ -n "${CI:-}" ]]; then
cp /get-runc/runc "$(dirname "$0")/../pkg/provider-local/node/runc"
else
"$(dirname "$0")/../pkg/provider-local/node/get-runc.sh"
fi
for node in $nodes; do
# workaround https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files
docker exec "$node" sh -c "sysctl fs.inotify.max_user_instances=8192"
# TODO(marc1404): Remove once kindest/node uses runc >= v1.2.4
# workaround issue with runc v1.2.3 provided by kindest/node:v1.32.0 by installing runc v1.2.4 manually (https://github.com/opencontainers/runc/pull/4555)
if [ -n "${CI:-}" ]; then
echo "Installing runc on node $node from container filesystem"
docker cp /get-runc/runc "$node":/usr/local/sbin/runc
else
docker cp "$(dirname "$0")/../pkg/provider-local/node/runc" "$node":/usr/local/sbin/runc
fi
done

As I understand it the issue with go-cni seems to affect our tests primarily.

@LucaBernstein
Copy link
Member Author

/test all
to increase certainty in stability

@LucaBernstein
Copy link
Member Author

Just asking to be sure: Do we need to apply the containerd override for the cluster node(s) spawned through kind-up.sh as well?

In that case we would need to pre-build the image, as it's only referenced there.
As it affects only local setups and tests, imho we could go with the current patch alone and wait for a updated release soon.

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this light, do we need all this at all, i.e., can't we go back v1.31 and wait for a proper patch for v1.32 to be released, before we upgrade to it again? This way, we don't need to introduce these workarounds? Or is there something in v1.32 that we desperately need now?

@gardener-prow gardener-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 28, 2025
@LucaBernstein LucaBernstein force-pushed the try-no-override-runc branch 2 times, most recently from 247431d to 7772fbe Compare February 28, 2025 13:13
@LucaBernstein
Copy link
Member Author

Also addresse @marc1404's concern and override containerd in kind node, as local failure(s) have also been observed there by now.

In this light, do we need all this at all, i.e., can't we go back v1.31 and wait for a proper patch for v1.32 to be released, before we upgrade to it again? This way, we don't need to introduce these workarounds? Or is there something in v1.32 that we desperately need now?

The proposed changes make the problem + fix more explicit. I personally would vote for this "forward-fix".

@rfranzke
Copy link
Member

I'm a fan of "fix-forward" if there is a real problem :) However, if there is nothing we gain out of v1.32 right now, I don't quite understand why we spend all this time to fix a problem that we wouldn't had if we just waited for the next release. Now we have workarounds and TODOs in our code and gain basically nothing out of it. 😕
Given that it's too late now, I'm OK with merging this PR, of course...

This includes the still supported versions for runc (v1.1.13) and containerd (v1.7.18).
@LucaBernstein LucaBernstein changed the title [Flaky Tests] Update containerd to include fixed go-cni version [Flaky Tests] Update/revert kindest/node to v1.32.0 Feb 28, 2025
@LucaBernstein
Copy link
Member Author

After another round of discussion we concluded that it might be best to revert/update to kindest/node v1.32.0 and get rid of both the runc and containerd version overrides.

@LucaBernstein
Copy link
Member Author

working as intended:

runc version 1.1.13
commit: v1.1.13-0-g58aa920
spec: 1.0.2-dev
go: go1.22.9
libseccomp: 2.5.4
containerd github.com/containerd/containerd v1.7.18 ae71819c4f5e67bb4d5ae76a6b735f29cc25774e

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@rfranzke rfranzke requested a review from marc1404 March 3, 2025 07:34
@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2025
Copy link
Member

@marc1404 marc1404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thank you! After a lot of back and forth I'm glad we can remove this workaround again 😅

I've added a commit to, temporarily, require a Renovate dashboard approval for patch updates of kindest/node. I'll take care of checking future kindest/node releases on whether they ship with the correct version of runc and containerd in order to remove this restriction again.
2221f70 (#11532)

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 3, 2025
Copy link
Contributor

gardener-prow bot commented Mar 3, 2025

LGTM label has been added.

Git tree hash: dd7c93a89e65a457eebe3fd4692dbe7f539f2d82

Copy link
Contributor

gardener-prow bot commented Mar 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marc1404, rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Testing related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/flake Tracking or fixing a flaky test lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants