Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Node Images List in facts #11092

Open
Payback159 opened this issue Apr 16, 2024 · 11 comments
Open

Cache Node Images List in facts #11092

Payback159 opened this issue Apr 16, 2024 · 11 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@Payback159
Copy link
Contributor

Payback159 commented Apr 16, 2024

What would you like to be added

I would like to cache the list of node images through following extension:

git diff Output of roles/download/tasks/check_pull_required.yml

-- name: Check_pull_required |  Generate a list of information about the images on a node  # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
+- name: Check_pull_required |  Generate a list of information about the images on a node # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
   shell: "{{ image_info_command }}"
-  register: docker_images
+  register: node_images
   changed_when: false
   check_mode: no
-  when: not download_always_pull
+  when:
+    - not download_always_pull
+    - node_container_images is undefined
+
+- name: Check_pull_required | Set fact node_container_images to the list of images on the node
+  set_fact:
+    node_container_images: "{{ node_images.stdout }}"
+  when:
+    - not download_always_pull
+    - node_container_images is undefined
 
 - name: Check_pull_required | Set pull_required if the desired image is not yet loaded
   set_fact:
     pull_required: >-
-      {%- if image_reponame | regex_replace('^docker\.io/(library/)?', '') in docker_images.stdout.split(',') %}false{%- else -%}true{%- endif -%}
+      {%- if image_reponame | regex_replace('^docker\.io/(library/)?', '') in node_container_images.split(',') %}false{%- else -%}true{%- endif -%}
   when: not download_always_pull
 
 - name: Check_pull_required | Check that the local digest sha256 corresponds to the given image tag
   assert:
-    that: "{{ download.repo }}:{{ download.tag }} in docker_images.stdout.split(',')"
+    that: "{{ download.repo }}:{{ download.tag }} in node_container_images.split(',') }}"
   when:
     - not download_always_pull
     - not pull_required

Why is this needed

I have several Kubernetes clusters with 60-90 nodes.

From this size, I notice that the upgrade-cluster.yml keeps aborting with the following error message and sometimes the upgrades would no longer work at all without separation with node limits. However, the error message does not always appear on the same node, but is randomly distributed across several nodes.

"Check_pull_required | Generate a list of information about the images on a node"
fatal: [dev-k8s-node-nf-32]: FAILED! => {"changed": false, "module_stderr": "", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": -13}

I have therefore checked how often and why the task is called.

In my case the task is called 20 times, but for my understanding it would be enough if the task is only called once for the whole run and the result can be cached. I don't think that the list changes during the Kubespray upgrade run in such a way that the changes would be relevant for the Kubespray run. Therefore, from my point of view, it makes no sense to fetch the list every time I have to download another image.

That means 20 x "node count" this error can occur and also executing 20 x the{{ image_info_command }} against the container runtime.

I'm currently testing the fix, but I wanted to get the opinions of the community in advance.

@Payback159 Payback159 added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 16, 2024
@Payback159
Copy link
Contributor Author

Looks like it's an bug in OpenSSH and I am not the only one with the problem. ansible/ansible#81777

@VannTen
Copy link
Contributor

VannTen commented Apr 22, 2024

I don't think we would need to use a fact, a registered variable is available during the rest of the playbook runs, so we can just use the registered variable directly and avoid setting a fact.

@Payback159
Copy link
Contributor Author

Payback159 commented Apr 22, 2024

Hello @VannTen ,

thanks for your feedback. At first I also thought that it is not necessary to set a fact and as a first approach I just tried to extend the task with the condition

  ...
  when:
   ...
    - docker_images is undefined

The task as a whole example then looked like this:

# The image_info_command depends on the Container Runtime and will output something like the following:
# nginx:1.15,gcr.io/google-containers/kube-proxy:v1.14.1,gcr.io/google-containers/kube-proxy@sha256:44af2833c6cbd9a7fc2e9d2f5244a39dfd2e31ad91bf9d4b7d810678db738ee9,gcr.io/google-containers/kube-apiserver:v1.14.1,etc...
- name: Check_pull_required |  Generate a list of information about the images on a node  # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
- name: Check_pull_required |  Generate a list of information about the images on a node # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
  shell: "{{ image_info_command }}"
  register: docker_images
  changed_when: false
  check_mode: no
  when: not download_always_pull
  when:
    - not download_always_pull
    - docker_images is undefined

But this always leads me to the problem that in the first run (first image, not ansible run), the variable is set correctly. But the second run/image Ansible throws an error that stdout isn't defined.

When the tasks look like that:

Short note: Here I have replaced docker_images with node_images, but it shouldn't really make a difference, I just thought if I change something here I could rename it to the generic name, since the task doesn't only work with docker images.

---
# The image_info_command depends on the Container Runtime and will output something like the following:
# nginx:1.15,gcr.io/google-containers/kube-proxy:v1.14.1,gcr.io/google-containers/kube-proxy@sha256:44af2833c6cbd9a7fc2e9d2f5244a39dfd2e31ad91bf9d4b7d810678db738ee9,gcr.io/google-containers/kube-apiserver:v1.14.1,etc...
- name: Check_pull_required |  Generate a list of information about the images on a node # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
  shell: "{{ image_info_command }}"
  register: node_images
  changed_when: false
  check_mode: no
  when:
    - not download_always_pull
    - node_images is undefined

- name: Check_pull_required | Set pull_required if the desired image is not yet loaded
  set_fact:
    pull_required: >-
      {%- if image_reponame | regex_replace('^docker\.io/(library/)?', '') in node_images.stdout.split(',') %}false{%- else -%}true{%- endif -%}
  when:
    - not download_always_pull

- name: Check_pull_required | Check that the local digest sha256 corresponds to the given image tag
  assert:
    that: "{{ download.repo }}:{{ download.tag }} in node_images.stdout.split(',') }}"
  when:
    - not download_always_pull
    - not pull_required
    - pull_by_digest
  tags:
    - asserts

I am getting following error

# First Run/image is behaving like expected
TASK [download : Check_pull_required |  Generate a list of information about the images on a node] ***
task path: /kubespray/roles/download/tasks/check_pull_required.yml:4
ok: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-1] => {"changed": false, "cmd": "/usr/local/bin/nerdctl -n k8s.io images --format '{{ .Repository }}:{{ .Tag }}' 2>/dev/null | grep -v ^:$ | tr '\n' ','", "delta": "0:00:00.026868", "end": "2024-04-22 11:34:16.897943", "msg": "", "rc": 0, "start": "2024-04-22 11:34:16.871075", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-2] => {"changed": false, "cmd": "/usr/local/bin/nerdctl -n k8s.io images --format '{{ .Repository }}:{{ .Tag }}' 2>/dev/null | grep -v ^:$ | tr '\n' ','", "delta": "0:00:00.027118", "end": "2024-04-22 11:34:16.923514", "msg": "", "rc": 0, "start": "2024-04-22 11:34:16.896396", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-node-nf-1] => {"changed": false, "cmd": "/usr/local/bin/nerdctl -n k8s.io images --format '{{ .Repository }}:{{ .Tag }}' 2>/dev/null | grep -v ^:$ | tr '\n' ','", "delta": "0:00:00.028809", "end": "2024-04-22 11:34:16.976791", "msg": "", "rc": 0, "start": "2024-04-22 11:34:16.947982", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-3] => {"changed": false, "cmd": "/usr/local/bin/nerdctl -n k8s.io images --format '{{ .Repository }}:{{ .Tag }}' 2>/dev/null | grep -v ^:$ | tr '\n' ','", "delta": "0:00:00.026440", "end": "2024-04-22 11:34:16.973733", "msg": "", "rc": 0, "start": "2024-04-22 11:34:16.947293", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-node-nf-2] => {"changed": false, "cmd": "/usr/local/bin/nerdctl -n k8s.io images --format '{{ .Repository }}:{{ .Tag }}' 2>/dev/null | grep -v ^:$ | tr '\n' ','", "delta": "0:00:00.028615", "end": "2024-04-22 11:34:17.031499", "msg": "", "rc": 0, "start": "2024-04-22 11:34:17.002884", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
Monday 22 April 2024  11:33:34 +0000 (0:00:00.374)       0:05:45.887 ********** 

TASK [download : Check_pull_required | Set pull_required if the desired image is not yet loaded] ***
task path: /kubespray/roles/download/tasks/check_pull_required.yml:20
ok: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-1] => {"ansible_facts": {"pull_required": true}, "changed": false}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-2] => {"ansible_facts": {"pull_required": true}, "changed": false}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-3] => {"ansible_facts": {"pull_required": true}, "changed": false}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-node-nf-1] => {"ansible_facts": {"pull_required": true}, "changed": false}
ok: [kubespray-2-24-1-50-beb0cf006-k8s-node-nf-2] => {"ansible_facts": {"pull_required": true}, "changed": false}
Monday 22 April 2024  11:33:35 +0000 (0:00:00.190)       0:05:46.078 ********** 
Monday 22 April 2024  11:33:35 +0000 (0:00:00.142)       0:05:46.220 **********

...

# Second image fails

TASK [download : Download_container | Prepare container download] **************
task path: /kubespray/roles/download/tasks/download_container.yml:18
included: /kubespray/roles/download/tasks/check_pull_required.yml for kubespray-2-24-1-50-beb0cf006-k8s-master-nf-1, kubespray-2-24-1-50-beb0cf006-k8s-master-nf-2, kubespray-2-24-1-50-beb0cf006-k8s-master-nf-3, kubespray-2-24-1-50-beb0cf006-k8s-node-nf-1, kubespray-2-24-1-50-beb0cf006-k8s-node-nf-2
Monday 22 April 2024  11:33:49 +0000 (0:00:00.240)       0:06:00.507 ********** 
Monday 22 April 2024  11:33:49 +0000 (0:00:00.151)       0:06:00.659 ********** 

TASK [download : Check_pull_required | Set pull_required if the desired image is not yet loaded] ***
task path: /kubespray/roles/download/tasks/check_pull_required.yml:20
fatal: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-1]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'\n\nThe error appears to be in '/kubespray/roles/download/tasks/check_pull_required.yml': line 20, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Check_pull_required | Set pull_required if the desired image is not yet loaded\n  ^ here\n"}
fatal: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-2]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'\n\nThe error appears to be in '/kubespray/roles/download/tasks/check_pull_required.yml': line 20, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Check_pull_required | Set pull_required if the desired image is not yet loaded\n  ^ here\n"}
fatal: [kubespray-2-24-1-50-beb0cf006-k8s-master-nf-3]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'\n\nThe error appears to be in '/kubespray/roles/download/tasks/check_pull_required.yml': line 20, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Check_pull_required | Set pull_required if the desired image is not yet loaded\n  ^ here\n"}
fatal: [kubespray-2-24-1-50-beb0cf006-k8s-node-nf-1]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'\n\nThe error appears to be in '/kubespray/roles/download/tasks/check_pull_required.yml': line 20, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Check_pull_required | Set pull_required if the desired image is not yet loaded\n  ^ here\n"}
fatal: [kubespray-2-24-1-50-beb0cf006-k8s-node-nf-2]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'\n\nThe error appears to be in '/kubespray/roles/download/tasks/check_pull_required.yml': line 20, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Check_pull_required | Set pull_required if the desired image is not yet loaded\n  ^ here\n"}

If I interpret the error message correctly, it is because the variable node_images does not remain set, contrary to the assumption.

Unfortunately I can't explain 100% why it doesn't stay set (since I originally made the same assumption as you) but maybe you can think of something else.

Thank you and kind regards! :-)

@VannTen
Copy link
Contributor

VannTen commented Apr 22, 2024 via email

@Payback159
Copy link
Contributor Author

Hello @VannTen ,

thanks for the hint, I didn't even think of that. I have now added a debug step and in the 2nd run (where it breaks) I get the following message:

TASK [download : Debug | Print node_images.stdout] *****************************
task path: /kubespray/roles/download/tasks/check_pull_required.yml:13
ok: [kubespray-2-24-1-53-d9473e943-k8s-master-nf-1] => {
    "node_images.stdout": "VARIABLE IS NOT DEFINED!: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'"
}
ok: [kubespray-2-24-1-53-d9473e943-k8s-master-nf-2] => {
    "node_images.stdout": "VARIABLE IS NOT DEFINED!: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'"
}
ok: [kubespray-2-24-1-53-d9473e943-k8s-master-nf-3] => {
    "node_images.stdout": "VARIABLE IS NOT DEFINED!: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'"
}
ok: [kubespray-2-24-1-53-d9473e943-k8s-node-nf-1] => {
    "node_images.stdout": "VARIABLE IS NOT DEFINED!: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'"
}
ok: [kubespray-2-24-1-53-d9473e943-k8s-node-nf-2] => {
    "node_images.stdout": "VARIABLE IS NOT DEFINED!: 'dict object' has no attribute 'stdout'. 'dict object' has no attribute 'stdout'"
}

It seems the node_images variable is not preserved in the whole ansible run, because the file check_pull_required.yml are included dynamically via include_tasks in the download_container.yml file?

I have now dragged the tasks for creating the node image lists into download_container.yml and passed node_images.stdout as variable to check_pull_required.yml. This way node_images is preserved throughout the entire Ansible run and I can reuse the list for all images.

download_container.yml

...
    # The image_info_command depends on the Container Runtime and will output something like the following:
    # nginx:1.15,gcr.io/google-containers/kube-proxy:v1.14.1,gcr.io/google-containers/kube-proxy@sha256:44af2833c6cbd9a7fc2e9d2f5244a39dfd2e31ad91bf9d4b7d810678db738ee9,gcr.io/google-containers/kube-apiserver:v1.14.1,etc...
    - name: Download_container |  Generate a list of information about the images on a node # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
      shell: "{{ image_info_command }}"
      register: node_images
      changed_when: false
      check_mode: no
      when:
        - not download_always_pull
        - node_images is undefined

    - name: Download_container | Prepare container download
      include_tasks: check_pull_required.yml
      when:
        - not download_always_pull
      vars:
        node_images_stdout: "{{ node_images.stdout }}"
...

check_pull_required.yml

---
- name: Check_pull_required | Set pull_required if the desired image is not yet loaded
  set_fact:
    pull_required: >-
      {%- if image_reponame | regex_replace('^docker\.io/(library/)?', '') in node_images_stdout.split(',') %}false{%- else -%}true{%- endif -%}
  when:
    - not download_always_pull
    - node_images.stdout is defined

- name: Check_pull_required | Check that the local digest sha256 corresponds to the given image tag
  assert:
    that: "{{ download.repo }}:{{ download.tag }} in node_images_stdout.split(',') }}"
  when:
    - not download_always_pull
    - not pull_required
    - pull_by_digest
    - node_images.stdout is defined
  tags:
    - asserts

I am also running an upgrade cluster to check the logic with a "filled" container runtime. So far I have only tried it with cluster.yml (i.e. always fresh clusters) and the list is of course empty, let's see if it also fills successfully when the container runtime has already been used.

I don't see why it should behave differently, but better tested than wrongly assumed.

@VannTen
Copy link
Contributor

VannTen commented Apr 22, 2024 via email

@Payback159
Copy link
Contributor Author

If that's the case I don't quite understand how Ansible handles variable registration. Since the first run can access node_images.stdout but the 2nd breaks with the fact that node_images has no stdout.

I can still try if I can generally access node_images as dict_object and not node_images.stdout. So I wouldn't have to check the node_images in download_container.yml and pass them as variables to the include_tasks of check_pull_required.yml.

I tried to print node_images.stdout because the original tasks always went to stdout and I wanted to make as few changes as possible.

But if it makes more sense to leave the variable handling as close as possible to the tasks, I can still try to access node_images instead of node_images.stdout. But for this I have to check whether those check still fit.

- name: Check_pull_required | Set pull_required if the desired image is not yet loaded
  set_fact:
    pull_required: >-
      {%- if image_reponame | regex_replace('^docker\.io/(library/)?', '') in node_images_stdout.split(',') %}false{%- else -%}true{%- endif -%}
  when:
    - not download_always_pull
    - node_images.stdout is defined

- name: Check_pull_required | Check that the local digest sha256 corresponds to the given image tag
  assert:
    that: "{{ download.repo }}:{{ download.tag }} in node_images_stdout.split(',') }}"
  when:
    - not download_always_pull
    - not pull_required
    - pull_by_digest
    - node_images.stdout is defined
  tags:
    - asserts

@VannTen
Copy link
Contributor

VannTen commented Apr 22, 2024 via email

@Payback159
Copy link
Contributor Author

Payback159 commented Apr 22, 2024

I have now created a new branch and only added the debug step to output node_images and get the following result in the 2nd run

check_pull_required.yml

- name: Check_pull_required |  Generate a list of information about the images on a node # noqa command-instead-of-shell - image_info_command contains a pipe, therefore requiring shell
  shell: "{{ image_info_command }}"
  register: node_images
  changed_when: false
  check_mode: no
  when:
    - not download_always_pull
    - node_images is undefined

- name: Check_pull_required | Print the list of images on the node
  debug:
    var: node_images
  when:
    - not download_always_pull

Output:

TASK [download : Check_pull_required | Print the list of images on the node] ***
task path: /kubespray/roles/download/tasks/check_pull_required.yml:13
ok: [kubespray-2-24-1-checkpullrequired-2-2bb8cb34e-k8s-master-nf-1] => {
    "node_images": {
        "changed": false,
        "false_condition": "node_images is undefined",
        "skip_reason": "Conditional result was False",
        "skipped": true
    }
}

It can still process the first image because "Check_pull_required | Set pull_required if the desired image is not yet loaded" is executed. With the 2nd image, the task is no longer executed and access to node_images fails.

I think that because of the include_tasks expression, the variable only exists while check_pull_required.yml is running and as soon as Ansible switches back to download_container.yml, the variable is cleaned up.

OK, my last sentences don't make sense, because if the variable node_images is undefined, the function "Check_pull_required | Set pull_required if the desired image is not yet loaded" would be executed also in the 2nd run.

@Payback159
Copy link
Contributor Author

OK, I think I've understood you now. I thought the register is only executed if the task is also executed, but register is always executed by Ansible, regardless of whether the when condition applies or not and that explains the behavior.

But then I only have the option of generating the image list in download_container.yml or alternatively using a facts variable, right?

@VannTen
Copy link
Contributor

VannTen commented Apr 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants