Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fix: properly restart cloud-init #5116

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

faiq
Copy link
Contributor

@faiq faiq commented Sep 5, 2024

What type of PR is this?
/kind fix

What this PR does / why we need it:

It seems that systemctl restart clout-init is no longer sufficient to start the cloud-init process again with the secret userdata. After some googling I found the following steps to restart it without needing to reboot the machine. The step comes from here https://cloudinit.readthedocs.io/en/latest/howto/rerun_cloud_init.html

We should move towards an approach that doesn't require a restart of cloud-init to handle secret userdata.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5115

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

fix: uses different commands to restart cloud-init

@k8s-ci-robot
Copy link
Contributor

@faiq: The label(s) kind/fix cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?
/kind fix

What this PR does / why we need it:

It seems that systemctl restart clout-init is no longer sufficient to start the cloud-init process again with the secret userdata. After some googling I found the following steps to restart it without needing to reboot the machine.

We should move towards an approach that doesn't require a restart of cloud-init to handle secret userdata.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5115

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

fix: uses different commands to restart cloud-init

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 5, 2024
@k8s-ci-robot k8s-ci-robot added needs-priority size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 5, 2024
@faiq faiq changed the title fix: properly restart cloud-init 🐛 fix: properly restart cloud-init Sep 5, 2024
@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/test ?

@k8s-ci-robot
Copy link
Contributor

@faiq: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/test pull-cluster-api-provider-aws-e2e

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/retest

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/test pull-cluster-api-provider-aws-e2e-eks

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 5, 2024
@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/retest

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/test pull-cluster-api-provider-aws-e2e

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/retest

Copy link

@SriRamanujam SriRamanujam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't work :(

It definitely rebooted the instance at the expected time, but apparently cloud-init doesn't like it when /var/lib/cloud/instance hangs around.

Let me test a couple potential options and see which one works.

Sep 05 20:41:24 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:24+00:00] appending data to temporary file /etc/secret-userdata.txt.gz
Sep 05 20:41:24 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:24+00:00] getting userdata from AWS Secrets Manager
Sep 05 20:41:24 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:24+00:00] getting secret value from AWS Secrets Manager
Sep 05 20:41:25 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:25+00:00] AWS CLI reported successful execution for SecretsManager::GetSecretValue
Sep 05 20:41:25 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:25+00:00] appending data to temporary file /etc/secret-userdata.txt.gz
Sep 05 20:41:25 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:25+00:00] deleting secret from AWS Secrets Manager
Sep 05 20:41:26 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:26+00:00] AWS CLI reported successful execution for SecretsManager::DeleteSecret
Sep 05 20:41:26 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:26+00:00] deleting secret from AWS Secrets Manager
Sep 05 20:41:27 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:27+00:00] AWS CLI reported successful execution for SecretsManager::DeleteSecret
Sep 05 20:41:27 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:27+00:00] decompressing userdata to /etc/secret-userdata.txt
Sep 05 20:41:27 ip-10-80-81-103 cloud-init[587]: +++ [2024-09-05T20:41:27+00:00] restarting cloud-init
Sep 05 20:41:28 ip-10-80-81-103 cloud-init[587]: Failed to connect to bus: No such file or directory
Sep 05 20:41:31 ip-10-80-81-103 passwd[1309]: password for 'ubuntu' changed by 'root'
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: cloud-init.service: Main process exited, code=exited, status=120/n/a
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: cloud-init.service: Failed with result 'exit-code'.
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: Stopped cloud-init.service - Cloud-init: Network Stage.
Sep 05 20:41:32 ip-10-80-81-103 systemd[1]: cloud-init.service: Consumed 6.430s CPU time.
-- Boot ff2115d67d254a9ca580ea0d70ae67b1 --
Sep 05 20:41:51 ip-10-80-81-103 systemd[1]: Starting cloud-init.service - Cloud-init: Network Stage...
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: Cloud-init v. 24.2-0ubuntu1~24.04.2 running 'init' at Thu, 05 Sep 2024 20:41:52 +0000. Up 9.45 seconds.
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | Device |  Up  |           Address           |      Mask     | Scope  |     Hw-Address    |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |  ens5  | True |         10.80.81.103        | 255.255.254.0 | global | 0e:74:35:1c:fa:fd |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |  ens5  | True | fe80::c74:35ff:fe1c:fafd/64 |       .       |  link  | 0e:74:35:1c:fa:fd |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   lo   | True |          127.0.0.1          |   255.0.0.0   |  host  |         .         |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | Route | Destination |  Gateway   |     Genmask     | Interface | Flags |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   0   |   0.0.0.0   | 10.80.80.1 |     0.0.0.0     |    ens5   |   UG  |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   1   |  10.80.80.0 |  0.0.0.0   |  255.255.254.0  |    ens5   |   U   |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   2   |  10.80.80.1 |  0.0.0.0   | 255.255.255.255 |    ens5   |   UH  |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   3   |  10.80.80.2 |  0.0.0.0   | 255.255.255.255 |    ens5   |   UH  |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+------------+-----------------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+---------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: | Route | Destination | Gateway | Interface | Flags |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+---------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   0   |  fe80::/64  |    ::   |    ens5   |   U   |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   1   |    local    |    ::   |    ens5   |   U   |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: |   2   |  multicast  |    ::   |    ens5   |   U   |
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ci-info: +-------+-------------+---------+-----------+-------+
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: 2024-09-05 20:41:52,520 - main.py[ERROR]: failed stage init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: Traceback (most recent call last):
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 797, in status_wrapper
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     ret = functor(name, args)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:           ^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 436, in main_init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     iid = init.instancify()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:           ^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 541, in instancify
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     return self._reflect_cur_instance()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 461, in _reflect_cur_instance
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     util.del_file(self.paths.instance_link)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2069, in del_file
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     os.unlink(path)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: IsADirectoryError: [Errno 21] Is a directory: '/var/lib/cloud/instance'
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: failed run of stage init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ------------------------------------------------------------
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: Traceback (most recent call last):
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 797, in status_wrapper
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     ret = functor(name, args)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:           ^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 436, in main_init
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     iid = init.instancify()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:           ^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 541, in instancify
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     return self._reflect_cur_instance()
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 461, in _reflect_cur_instance
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     util.del_file(self.paths.instance_link)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:   File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2069, in del_file
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]:     os.unlink(path)
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: IsADirectoryError: [Errno 21] Is a directory: '/var/lib/cloud/instance'
Sep 05 20:41:52 ip-10-80-81-103 cloud-init[702]: ------------------------------------------------------------
Sep 05 20:41:52 ip-10-80-81-103 systemd[1]: cloud-init.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 05 20:41:52 ip-10-80-81-103 systemd[1]: cloud-init.service: Failed with result 'exit-code'.
Sep 05 20:41:52 ip-10-80-81-103 systemd[1]: Failed to start cloud-init.service - Cloud-init: Network Stage.

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/test pull-cluster-api-provider-aws-e2e

Copy link

@SriRamanujam SriRamanujam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After trying everything and getting nowhere, I tried just rebooting the machine out of desperation. It worked. So maybe we just do this.

Comment on lines +234 to +235
rm -rf /var/lib/cloud/instances
cloud-init clean --reboot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rm -rf /var/lib/cloud/instances
cloud-init clean --reboot
reboot

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried the following that worked without reboot:

rm -rf /var/lib/cloud/instances
cloud-init clean
systemctl restart cloud-init-local
systemctl restart cloud-init
systemctl restart cloud-config
systemctl restart cloud-final

Comment on lines +199 to +200
rm -rf /var/lib/cloud/instances
cloud-init clean --reboot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rm -rf /var/lib/cloud/instances
cloud-init clean --reboot
reboot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im going to try again with some updates i made to failing tests #5118

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SriRamanujam the tests are all passing - mind building and trying again locally?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also hit this issue and we found out that reboot is not needed after cloud-init clean. We just needed to restart all cloud-init services in the order.

Suggested change
rm -rf /var/lib/cloud/instances
cloud-init clean --reboot
rm -rf /var/lib/cloud/instances
cloud-init clean
systemctl restart cloud-init-local
systemctl restart cloud-init
systemctl restart cloud-config
systemctl restart cloud-final

@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/retest

1 similar comment
@faiq
Copy link
Contributor Author

faiq commented Sep 5, 2024

/retest

@faiq
Copy link
Contributor Author

faiq commented Sep 9, 2024

/retest

@faiq
Copy link
Contributor Author

faiq commented Sep 9, 2024

/test pull-cluster-api-provider-aws-e2e

@faiq
Copy link
Contributor Author

faiq commented Sep 10, 2024

/retest

1 similar comment
@faiq
Copy link
Contributor Author

faiq commented Sep 10, 2024

/retest

@faiq
Copy link
Contributor Author

faiq commented Sep 10, 2024

the e2e tests seem to pass- does anyone have suggestions on what other tests i should run?

@richardcase
Copy link
Member

/test ?

@k8s-ci-robot
Copy link
Contributor

@richardcase: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-build-docker
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-build-docker
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@richardcase
Copy link
Member

Lets also run the eks e2e just in case:

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

Until the eks e2e passes:

/hold

I think this looks good to me. @faiq would you be able to add a note on any manual testing you have done with this?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2024
@richardcase
Copy link
Member

Both the non-eks and eks e2e tests are passing with this change.

@richardcase
Copy link
Member

/cherrypick release-2.6

@k8s-infra-cherrypick-robot

@richardcase: once the present PR merges, I will cherry-pick it on top of release-2.6 in a new PR and assign it to you.

In response to this:

/cherrypick release-2.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@richardcase
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2024
@SriRamanujam
Copy link

I ran another test, using each of the three variants discussed above.

tl;dr As before, rebooting is the only thing that works for me.

Test notes

Kubernetes: 1.29.6
cloud-init version: 24.2-0ubuntu1~24.04.2
Ubuntu version: 24.04 Noble

Procedure: build and deploy PR branch to k8s cluster, then attempt to stand up a CAPI cluster, wait for the first control plane machine to come up,
and then SSH into it and see what's happening. For the variants that aren't in Git, I manually edited secretsmanager/secret_fetch_script.go and ssm/secret_fetch_script_go.

Expectation: the control plane comes up and the node goes ready with no intervention.

cloud-init clean --reboot 🔴 NOT WORKING

It rebooted, but on the subsequent run, it failed like so:

Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: 2024-10-05 15:43:03,114 - main.py[ERROR]: failed stage init-local
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: Traceback (most recent call last):
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 797, in status_wrapper
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     ret = functor(name, args)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:           ^^^^^^^^^^^^^^^^^^^
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 402, in main_init
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     init.fetch(existing=existing)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 538, in fetch
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     return self._get_data_source(existing=existing)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 397, in _get_data_source
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     util.del_file(self.paths.instance_link)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2069, in del_file
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     os.unlink(path)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: IsADirectoryError: [Errno 21] Is a directory: '/var/lib/cloud/instance'
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: failed run of stage init-local
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: ------------------------------------------------------------
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: Traceback (most recent call last):
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 797, in status_wrapper
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     ret = functor(name, args)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:           ^^^^^^^^^^^^^^^^^^^
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/cmd/main.py", line 402, in main_init
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     init.fetch(existing=existing)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 538, in fetch
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     return self._get_data_source(existing=existing)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 397, in _get_data_source
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     util.del_file(self.paths.instance_link)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:   File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2069, in del_file
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]:     os.unlink(path)
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: IsADirectoryError: [Errno 21] Is a directory: '/var/lib/cloud/instance'
Oct 05 15:43:03 ip-10-80-80-56 cloud-init[671]: ------------------------------------------------------------
Oct 05 15:43:03 ip-10-80-80-56 systemd[1]: cloud-init-local.service: Main process exited, code=exited, status=1/FAILURE
Oct 05 15:43:03 ip-10-80-80-56 systemd[1]: cloud-init-local.service: Failed with result 'exit-code'.
Oct 05 15:43:03 ip-10-80-80-56 systemd[1]: Failed to start cloud-init-local.service - Cloud-init: Local Stage (pre-network).
k get machine -A

NAMESPACE             NAME                                      CLUSTER               NODENAME   PROVIDERID                              PHASE         AGE    VERSION
test   test-control-plane-7w986   test              aws:///us-east-1a/i-<redacted>   Provisioned   9m3s   v1.29.6

reboot 🟢 WORKING

Nodes come up without intervention. Everything works as expected.

test   test-control-plane-99f2n   test   ip-<redacted>.ec2.internal    aws:///us-east-1b/i-<redacted>   Running       3m51s   v1.29.6
test   test-control-plane-mkfgg   test                                 aws:///us-east-1c/i-<redacted>   Provisioned   83s     v1.29.6
test   test-control-plane-zq8js   test   ip-<redacted>.ec2.internal    aws:///us-east-1a/i-<redacted>   Running       7m20s   v1.29.6

restarting each cloud-init stage via systemctl 🔴 NOT WORKING

cloud-init did get restarted from scratch, but it did not run the part-002 boothook to actually bootstrap the system.

+++ [2024-10-05T16:22:42+00:00] restarting cloud-init
Cloud-init v. 24.2-0ubuntu1~24.04.2 running 'init-local' at Sat, 05 Oct 2024 16:22:43 +0000. Up 44.49 seconds.
Cloud-init v. 24.2-0ubuntu1~24.04.2 running 'init' at Sat, 05 Oct 2024 16:22:48 +0000. Up 49.37 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: | Device |  Up  |          Address           |      Mask     | Scope  |     Hw-Address    |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: |  ens5  | True |        10.80.83.240        | 255.255.254.0 | global | 02:bd:e8:07:62:a5 |
ci-info: |  ens5  | True | fe80::bd:e8ff:fe07:62a5/64 |       .       |  link  | 02:bd:e8:07:62:a5 |
ci-info: |   lo   | True |         127.0.0.1          |   255.0.0.0   |  host  |         .         |
ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
ci-info: +++++++++++++++++++++++++++++Route IPv4 info++++++++++++++++++++++++++++++
ci-info: +-------+-------------+------------+-----------------+-----------+-------+
ci-info: | Route | Destination |  Gateway   |     Genmask     | Interface | Flags |
ci-info: +-------+-------------+------------+-----------------+-----------+-------+
ci-info: |   0   |   0.0.0.0   | 10.80.82.1 |     0.0.0.0     |    ens5   |   UG  |
ci-info: |   1   |  10.80.80.2 | 10.80.82.1 | 255.255.255.255 |    ens5   |  UGH  |
ci-info: |   2   |  10.80.82.0 |  0.0.0.0   |  255.255.254.0  |    ens5   |   U   |
ci-info: |   3   |  10.80.82.1 |  0.0.0.0   | 255.255.255.255 |    ens5   |   UH  |
ci-info: +-------+-------------+------------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: |   0   |  fe80::/64  |    ::   |    ens5   |   U   |
ci-info: |   1   |    local    |    ::   |    ens5   |   U   |
ci-info: |   2   |  multicast  |    ::   |    ens5   |   U   |
ci-info: +-------+-------------+---------+-----------+-------+
+++ [2024-10-05T16:22:48+00:00] aws.cluster.x-k8s.io encrypted cloud-init script /var/lib/cloud/instances/i-0babdb1f5d5781229/boothooks/part-001 started
+++ [2024-10-05T16:22:48+00:00] secret prefix: aws.cluster.x-k8s.io/325ed06d-db7f-4239-b66c-af0736b1a764
+++ [2024-10-05T16:22:48+00:00] secret count: 2
+++ [2024-10-05T16:22:48+00:00] encrypted userdata already written to disk
+++ [2024-10-05T16:22:48+00:00] aws.cluster.x-k8s.io encrypted cloud-init script /var/lib/cloud/instances/i-0babdb1f5d5781229/boothooks/part-001 finished
Cloud-init v. 24.2-0ubuntu1~24.04.2 running 'modules:config' at Sat, 05 Oct 2024 16:22:49 +0000. Up 50.37 seconds.
Cloud-init v. 24.2-0ubuntu1~24.04.2 running 'modules:final' at Sat, 05 Oct 2024 16:23:04 +0000. Up 64.95 seconds.
Cloud-init v. 24.2-0ubuntu1~24.04.2 finished at Sat, 05 Oct 2024 16:23:04 +0000. Datasource DataSourceEc2Local.  Up 65.11 seconds

@Nalum
Copy link

Nalum commented Oct 8, 2024

@SriRamanujam what AMI are you using to test? Is it public?

Myself and @phoban01 have tested these two AMI with the 3 listed methods and all work:

AMI CLOUD INIT VERSION
ami-096aecdfffea2cc06 24.3.1-0ubuntu0~22.04.1
ami-04e0f2fcf7c7c4550 24.3.1-0ubuntu0~24.04.2

These two are the new CAPA AMI located in us-west-2.

We did receive a warning to run systemctl daemon-reload when we ran systemctl restart cloud-init-local.

@SriRamanujam
Copy link

@Nalum - I'm using an internally built AMI that's based on Ubuntu AMI ami-026d46db925b5c23b, which is a daily build from Aug 27 that has cloud-init 24.2 installed out of the box.

@richardcase @faiq - If others are seeing success and the CI is passing, please don't block this on me. I think there are enough confounding variables in my specific case that there's probably something else going on.

@faiq
Copy link
Contributor Author

faiq commented Oct 8, 2024

I think we're safe to merge this.

Thank you @SriRamanujam and @Nalum for verifying the solution!

@richardcase
Copy link
Member

I must be doing something wrong (or perhaps there is an issue with the AMIs i'm using) but non of the proposed solutions are working for me 😢

@holmanb
Copy link

holmanb commented Oct 9, 2024

Hello @faiq @SriRamanujam @Nalum @zarcen @richardcase / all

Cloud-init developer here, I'd like to try to help if possible.

I don't think that restarting cloud-init or rebooting the instance is desirable, nor should it be necessary. See my reasoning below.

How things currently work

If I understand correctly, the goal of the bootscript is:

Get user-data from the secret store, then make cloud-init run it.

As implemented, this script:

  1. uses the AWS CLI to get the user-data
  2. restarts cloud-init in order to force cloud-init to see the "secret" userdata

A little bit about cloud-init

Restarting cloud-init before it is completed may have unexpected consequences. Similarly, restarting the instance may break things in unexpected ways (and is obviously slower than a non-reboot solution).

Cloud-init has a code concept called "datasources". These classes define where user-data comes from. This is what makes it possible to run the same cloud-init package on EC2 and on other clouds. From my perspective, a custom datasource would be the preferred solution. The current boothook script attempts to do the same thing that a cloud-init datasource does, but in a way that assumes things about cloud-init that it probably shouldn't.

I wrote a hackish proof of concept of a custom datasource which appears to work from my limited testing. I wouldn't propose it as-is, but perhaps we can use it as a straw-man to find a more robust and maintainable path forward. The commit message explains how it works and how to install it. Please let me know if you have any questions.

Some questions

How is the AWS CLI installed?
Are special images used which include it?
Would installing a configuration alongside a drop-in datasource in these images be a reasonable solution? (see the commit message for more details)

@faiq
Copy link
Contributor Author

faiq commented Oct 10, 2024

Wow! that sounds amazing. Restarting cloud-init is definitely hack-y and im glad you came up with something that doesn't require it

How is the AWS CLI installed?
Are special images used which include it?
Would installing a configuration alongside a drop-in datasource in these images be a reasonable solution? (see the commit message for more details)

AWS CLI is installed through image builder and we get the images built via that.

code linked here: https://github.com/kubernetes-sigs/image-builder/blob/2f188e738f961730645269fe942cfcbb0925db7a/images/capi/ansible/roles/providers/tasks/awscliv2.yml#L76

@SriRamanujam
Copy link

@holmanb This is super interesting, thanks for writing that up!

@dlipovetsky this may be relevant to our interests re: getting rid of cloud-init hackiness entirely

@Nalum
Copy link

Nalum commented Oct 10, 2024

That's great, thanks for the info and example @holmanb 👍

@richardcase
Copy link
Member

Fantastic, thank you @holmanb ❤️ Having the sample is excellent, i will give it a go today.

@chrischdi - you may like this as well....especially based on your suggestion yesterday after the CAPI meeting.

@richardcase
Copy link
Member

@holmanb - also to add further to @faiq response to the questions

  • Are special images used which include it?
    Yes, we build AWS AMIs specificly for the provider (using image-builder). Wea re currently being forced to rebuild our AMIs so there is a good window to get changes made to spec of the image.

  • Would installing a configuration alongside a drop-in datasource in these images be a reasonable solution? (see the commit message for more details)
    Some a great solution to me. With image-builder we can include a custom DS and make the config changes when we are building AMIs only (i.e. so it doesn't apply to any other CAPI image format).

@richardcase
Copy link
Member

@faiq @SriRamanujam - i'm going to build a custom AMI with the ds and config in for testing this morning. I will post the ami id if you want to try it.

@holmanb
Copy link

holmanb commented Oct 10, 2024

Thanks for the feedback @richardcase @Nalum @SriRamanujam @faiq!

Based on @faiq and @richardcase's responses, it sounds like including a drop-in cloud-init datasource should work.

@richardcase Thanks for testing! I don't know what exactly the image build process includes (I didn't read all of the ansible bits linked), but I just want to note that for cloud-init to consider it a "first boot", you'll need to run cloud-init clean (probably with the --logs flag to clean up any old logs). If you have any questions about setup / testing, feel free to stop by #cloud-init on Libera (just make sure to stay in the channel, you might not get a response immediately).

@richardcase
Copy link
Member

richardcase commented Oct 10, 2024

@richardcase Thanks for testing! I don't know what exactly the image build process includes (I didn't read all of the ansible bits linked), but I just want to note that for cloud-init to consider it a "first boot", you'll need to run cloud-init clean (probably with the --logs flag to clean up any old logs). If you have any questions about setup / testing, feel free to stop by #cloud-init on Libera (just make sure to stay in the channel, you might not get a response immediately).

Thanks @holmanb . We created a test AMI with these changes and have been testing them with CAPA from this PRs branch. The image-builder changes so far are on this wip pr: kubernetes-sigs/image-builder#1583.

We're running into an error when the boothook script runs in the local stage as the network and AWS creds are not setup yet and so the AWS cli calls fail. However, with this CAPA branch there is still a reboot and so when the machine comes up the second time the boothook script runs (as the network and creds are setup) and we get k8s coming up. Not ideal but its working, a lot further than any previous attempt 😄

Tomorrow we'll look at changing the logic of boothook script logic (and maybe the "local" datasource) to handle this situation better and remove the reboot.

I'll start staying logged into irc.....i shut it down last night. Maybe time to look at quassel or something similar.

Thanks again @holmanb 🙇‍♂️

@holmanb
Copy link

holmanb commented Oct 10, 2024

@richardcase Thanks for testing! I don't know what exactly the image build process includes (I didn't read all of the ansible bits linked), but I just want to note that for cloud-init to consider it a "first boot", you'll need to run cloud-init clean (probably with the --logs flag to clean up any old logs). If you have any questions about setup / testing, feel free to stop by #cloud-init on Libera (just make sure to stay in the channel, you might not get a response immediately).

Thanks @holmanb . We created a test AMI with these changes and have been testing them with CAPA from this PRs branch. The image-builder changes so far are on this wip pr: kubernetes-sigs/image-builder#1583.

We're running into an error when the boothook script runs in the local stage as the network and AWS creds are not setup yet and so the AWS cli calls fail. However, with this CAPA branch there is still a reboot and so when the machine comes up the second time the boothook script runs (as the network and creds are setup) and we get k8s coming up. Not ideal but its working, a lot further than any previous attempt 😄

Right, boothooks normally run in network stage. I saw that failure during testing, but I didn't look too hard at the warning since it succeeded when it tried again during network stage. It should be trivial to disable trying during local stage. I commented on your PR with a suggestion.

Tomorrow we'll look at changing the logic of boothook script logic (and maybe the "local" datasource) to handle this situation better and remove the reboot.

I'll start staying logged into irc.....i shut it down last night. Maybe time to look at quassel or something similar.

I saw your comment but not until after you had left. We do have channel logs but typically don't bother replying if the person asking has already left. FWIW I run quassel-core on a cheap cloud instance and I'm happy with it. I can connect using quasssel-client from any computer (the android app isn't bad either) without loosing history.

Thanks again @holmanb 🙇‍♂️

Happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cloud-init boothook logic broken with cloud-init 24.2
9 participants