Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kubernetes_daemon_set not waiting for rollout. Fixes #2092 #2419

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sbocinec
Copy link
Contributor

@sbocinec sbocinec commented Feb 9, 2024

Description

Fixes kubernetes_daemon_set_v1 resource to wait for rollout. Current check for the rollout when wait_for_rollout = true is ineffective and is not waiting for rollout at all as it checks a wrong status field of a DaemonSet.

❗ What is worse, there is currently no way how to correctly manage a DaemonSet using kubernetes provider as also kubernetes_manifest is affected by a nasty bug so every DaemonSet change ends up with a failure..

Impacted by the issue, I prepared a reproducer TF code here. Troubleshooting the issue by applying a new resource and changing existing one I noticed, the current code is checking a wrong field, so wait_for_rollout = true has no effect as incorrect status fiels is check.

Additionally, majority of the daemonset resource tests must have been fixed as after fixing the rollout issues, these have been failing as the the original manifests failed to be rolled out.

⚠️ this fix might break many existing DaemonSets as by default wait_for_rollout is set to true in the resource. Until now, waiting for rollout was ineffective, though after it's fixed, apply is going to wait until all the daemonset pods are in Ready state. I expect this might cause some issues for users using this resource as their existing TF code might start to time out on apply. This is also visible on the number of test case changes, that must have been updated.

Troubleshooting - apply of a new resource

Plan: 1 to add, 0 to change, 0 to destroy.                                                                                                                                                                                                                                                                                                                              

kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Creating...                                                                                                                   
kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Creation complete after 2s [id=default/i-am-not-waiting-for-rollout]                                                          
                                                                                          
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.      

The apply is almost instant. While the apply is running,, this is how the daemonset status looks like: `

$ while true; do kubectl get ds/i-am-not-waiting-for-rollout --output="jsonpath={.status}"  ;echo;  done` printing the DS Error from server (NotFound): daemonsets.apps "i-am-not-waiting-for-rollout" not found
...                                                                                          
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberMisscheduled":0,"numberReady":0,"numberUnavailable":2,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberMisscheduled":0,"numberReady":0,"numberUnavailable":2,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":1,"updatedNumberScheduled":2}

As you can see, checking CurrentNumberScheduled is not the right property to check if when we need to wait for the rollout as this number does not represent the DaemonSet rollout status. We must check for numberReady here.

Troubleshooting - change of an existing resource

Apply of a change:

Plan: 0 to add, 1 to change, 0 to destroy.                                                                                                                                           
                                                                                         
kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Modifying... [id=default/i-am-not-waiting-for-rollout]                                                                        
kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Modifications complete after 2s [id=default/i-am-not-waiting-for-rollout]                                                                                                                                                                                                                                           
                                                                                          
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.                                                                          

DS status output:

{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":5,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":1}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":1}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":1}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":6,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":6,"updatedNumberScheduled":2}

Acceptance tests

  • Have you added an acceptance test for the functionality being added?
  • Have you run the acceptance tests on this branch?

Output from acceptance testing:

$ make testacc TESTARGS='-run=TestAccKubernetesDaemonSetV1* -v' 
==> Checking that code complies with gofmt requirements...
go vet ./...
TF_ACC=1 go test "/home/stano/workspace/projects/tf/terraform-provider-kubernetes/kubernetes" -v -vet=off -run=TestAccKubernetesDaemonSetV1* -v -parallel 8 -timeout 3h
=== RUN   TestAccKubernetesDaemonSetV1_minimal
=== PAUSE TestAccKubernetesDaemonSetV1_minimal
=== RUN   TestAccKubernetesDaemonSetV1_basic
=== PAUSE TestAccKubernetesDaemonSetV1_basic
=== RUN   TestAccKubernetesDaemonSetV1_with_template_metadata
=== PAUSE TestAccKubernetesDaemonSetV1_with_template_metadata
=== RUN   TestAccKubernetesDaemonSetV1_initContainer
=== PAUSE TestAccKubernetesDaemonSetV1_initContainer
=== RUN   TestAccKubernetesDaemonSetV1_noTopLevelLabels
=== PAUSE TestAccKubernetesDaemonSetV1_noTopLevelLabels
=== RUN   TestAccKubernetesDaemonSetV1_with_tolerations
=== PAUSE TestAccKubernetesDaemonSetV1_with_tolerations
=== RUN   TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds
=== PAUSE TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds
=== RUN   TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile
=== PAUSE TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile
=== RUN   TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile
=== PAUSE TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile
=== RUN   TestAccKubernetesDaemonSetV1_with_resource_requirements
=== PAUSE TestAccKubernetesDaemonSetV1_with_resource_requirements
=== RUN   TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace
=== PAUSE TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace
=== CONT  TestAccKubernetesDaemonSetV1_minimal
=== CONT  TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds
=== CONT  TestAccKubernetesDaemonSetV1_noTopLevelLabels
=== CONT  TestAccKubernetesDaemonSetV1_with_template_metadata
=== CONT  TestAccKubernetesDaemonSetV1_with_resource_requirements
=== CONT  TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace
=== CONT  TestAccKubernetesDaemonSetV1_initContainer
=== CONT  TestAccKubernetesDaemonSetV1_with_tolerations
--- PASS: TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds (18.46s)
=== CONT  TestAccKubernetesDaemonSetV1_basic
--- PASS: TestAccKubernetesDaemonSetV1_minimal (18.49s)
=== CONT  TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile
--- PASS: TestAccKubernetesDaemonSetV1_noTopLevelLabels (18.54s)
=== CONT  TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile
--- PASS: TestAccKubernetesDaemonSetV1_with_tolerations (19.22s)
--- PASS: TestAccKubernetesDaemonSetV1_initContainer (22.81s)
--- PASS: TestAccKubernetesDaemonSetV1_with_template_metadata (31.53s)
--- PASS: TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace (32.90s)
--- PASS: TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile (14.42s)
--- PASS: TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile (23.03s)
--- PASS: TestAccKubernetesDaemonSetV1_with_resource_requirements (46.52s)
--- PASS: TestAccKubernetesDaemonSetV1_basic (28.36s)
PASS
ok  	github.com/hashicorp/terraform-provider-kubernetes/kubernetes	46.870s

Release Note

Release note for CHANGELOG:

`resource/kubernetes_daemon_set_v1`: fix an issue with the provider not waiting for rollout even `wait_for_rollout` is set to `true`

References

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment


resources {
limits = {
cpu = "0.5"
memory = "512Mi"
"nvidia/gpu" = "1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is causing the pods are unschedulable and thus rollout fails and the test never passes.

@@ -912,6 +916,7 @@ func testAccKubernetesDaemonSetV1ConfigWithContainerSecurityContextSeccompProfil
}
}
}
wait_for_rollout = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: rollout of this resource fails on K8s nodes that do not have any seccomp profiles locally available (e.g. nodes in a kind cluster), so let's not wait for rollout, as it's not important for this test.

@@ -962,6 +967,7 @@ func testAccKubernetesDaemonSetV1ConfigWithContainerSecurityContextSeccompProfil
}
}
}
wait_for_rollout = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: rollout of this resource fails on K8s nodes that do not have any seccomp profiles locally available (e.g. nodes in a kind cluster), so let's not wait for rollout, as it's not important for this test.

@@ -685,6 +685,7 @@ func testAccKubernetesDaemonSetV1ConfigWithTemplateMetadata(depName, imageName s
container {
image = "%s"
name = "containername"
command = ["sleep", "infinity"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the container commands must be added as by default the container just exits and the test is thus failing. This is now added for the majority of the test cases for this resource.

@github-actions github-actions bot added size/M and removed size/S labels Feb 19, 2024
@sbocinec sbocinec force-pushed the ds-fix-wait-for-rollout branch 3 times, most recently from 8321d84 to f4cbd9b Compare February 19, 2024 15:34
Copy link
Member

@alexsomesan alexsomesan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbocinec thank you for contributing this fix and the very detailed diagnosis of the issue.

The changes look good to me. While we wait for the tests to run, would you mind expanding the change log entry a bit to describe the change in behaviour that users can expect after upgrading?

@@ -0,0 +1,4 @@
```release-note:bug
`resource/kubernetes_daemon_set_v1`: fix an issue with the provider not waiting for rollout with `wait_for_rollout = true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a mention of the change in behaviour here, so users are well advised of what to expect when upgrading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsomesan That's a great point. I'm not sure about the wording of the note. I'm thinking about adding the following to the existing entry: "Note: As the wait_for_rollout is true by default, users might experience the apply operation of the existing code taking longer."

Or should the UPGRADE NOTES section be used instead as I see it was used in past e.g. https://github.com/hashicorp/terraform-provider-kubernetes/blob/main/CHANGELOG.md#161-april-18-2019 . In this case, I'm not sure how to add this extra section.

This fix ensures the resource correctly waits for the DaemonSet rollout
by using similar logic than the kubernetes_deployment. Additionally,
waitForDaemonSetReplicasFunc is renamed to waitForDaemonSetPodsFunc to
correctly describe the operation (there are no replicas in a DaemonSet).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Daemonset wait_for_rollout does not appear to have any impact.
2 participants