Upgrade prometheus stack to latest LTS versions by elelaysh · Pull Request #943 · stackhpc/ansible-slurm-appliance

elelaysh · 2026-04-08T09:21:53Z

Fix issues found via grype; switch to (our fork of) the prometheus-community ansible collection.

Updates all OS packages
Cleanup of /tmp in ansible/cleanup.yml has been fixed
Removes packages cockpit-system cockpit-bridge
Replaces rclone binary with a no-op script to avoid vulnerabilities
Alertmanager db now persisted in appliances state directory (so alert state e.g. silence will be persisted across control node rebuilds)
Adds ignore list for grype vulnerability scanner, see .grype.yaml
Upgrades to prometheus 3.5.3, alertmanager 0.32.1, node_exporter 1.11.1
Pins workflows for compliance with organisation allowlist

Note the grype allowlist is at .grype.yaml and includes some vulnerabilities which have no fix at this time.

elelaysh · 2026-04-08T12:36:29Z

do we want to store checksums or are we happy to download them alongside binaries from github?

elelaysh · 2026-04-08T12:48:05Z

Shall I remove our cloudalchemy.node_exporter cloudalchemy.prometheus alertmanager in this PR?

sjpb · 2026-04-08T13:31:36Z

do we want to store checksums or are we happy to download them alongside binaries from github?
I think we should store checksums in ansible variables, along with versions.

Shall I remove our cloudalchemy.node_exporter cloudalchemy.prometheus alertmanager in this PR?
Yes I think so - for my reference the first two are just requirements.yml and alertmanager is an actual role.

Is all the actual config backward-compatible, across all 3x ~~repos~~ roles?

elelaysh · 2026-04-09T12:20:38Z

Build run: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/24184216835

elelaysh · 2026-04-10T10:34:24Z

0690c0b build run https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/24235098749 from 0d874c7

sjpb · 2026-04-10T14:56:12Z

Deployed cluster from main @ 602a830
Ran hpctests:pingpong
Updated to ade3802, ran setup-env
Reimaged cluster to new image: ansible-playbook ansible/adhoc/rebuild.yml -e rebuild_image=openhpc-RL9-260410-0859-0d874c74
Ran ansible compute -ba "touch /var/lib/ansible-init.done" b/c .stackhpc env configured for compute-init, and I'd accidently already started site.
Ran site.yml - FAILS because it is trying to do:

- name: Configure alertmanager
      # TODO: compare with prometheus.prometheus.alertmanager
      ansible.builtin.include_role:
        name: alertmanager
        tasks_from: configure.yml

and that doesn't exist yet

Confirmed pre-rebuild pingpong job shows in slurm job dashboard
Confirmed do see network and CPU data when clicking thro to the job
Ran hpctests: pingmatrix
Confirmed job shows in slurm jobs dashboard
Confirmed do see network and CPU data when clicking thro to the job

sjpb · 2026-04-10T15:09:04Z

@ 748f820, confirmed monitoring.yml does complete!

TODO: should check the prom propagate binaries approach - I think this seemed bad before for large clusters, with the old cloudalchemy roles and we sort of patched it out/disabled it? But I don't remember the details TBH, maybe something about multiple downloads. But hopefully whatever we're doing to just do the install during image build should make this ok, would just like to check.

elelaysh · 2026-04-14T17:10:20Z

Rebuild for 3086ebe: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/24404000044

elelaysh · 2026-04-15T07:54:42Z

Is all the actual config backward-compatible, across all 3x ~~repos~~ roles?

It is backward-compatible for prometheus, node_exporter. alertmanager diverged more. See common/alertmanager.yml config.

elelaysh · 2026-04-15T08:01:42Z

TODO: should check the prom propagate binaries approach - I think this seemed bad before for large clusters, with the old cloudalchemy roles and we sort of patched it out/disabled it? But I don't remember the details TBH, maybe something about multiple downloads. But hopefully whatever we're doing to just do the install during image build should make this ok, would just like to check.

Please see our ansible-prometheus vs prometheus-community. They seem pretty much identical to me: download,checksum,unpack on ansible host, copy to all hosts. True: we wouldn't care if it was downloaded on each host due to the install.yml only run on the builder to produce the image.

sjpb

I think the initial PR comment should also note the following "incidental" changes:

Updates all OS packages
Cleanup of /tmp in ansible/cleanup.yml has been fixed
Removes packages cockpit-system cockpit-bridge
Alertmanager db now persisted in appliances state directory (so alert state e.g. silence will be persisted across control node rebuilds)
Adds ignore list for grype vulnerability scanner

sjpb

As per separate discussion, I feel the python intepreter logic is really hard to reason about, and e.g. means pre.yml runs under different interpreters depending on how it is called, which seems like its going to cause difficult problems at some point.

Decided to just run mysql under 3.9, which is needed to avoid a vuln in in that package.

sjpb · 2026-04-30T14:51:29Z

CI failure for RL8:

TASK [Query grafana for expected hpctests jobs] ********************************
task path: /home/runner/work/ansible-slurm-appliance/ansible-slurm-appliance/ansible/ci/check_grafana.yml:14
Thursday 30 April 2026  09:39:32 +0000 (0:00:00.738)       0:00:00.782 ******** 
fatal: [slurmci-RL8-1030-control]: FAILED! => {}

MSG:

The conditional check '_expected_jobs | difference(_found_jobs) == []' failed. The error was: error while evaluating conditional (_expected_jobs | difference(_found_jobs) == []): {{ _slurm_stats_jobs.docs | map(attribute='JobName', default='(json error in slurmstats data)') }}: 'dict object' has no attribute 'docs'. 'dict object' has no attribute 'docs'. {{ _slurm_stats_jobs.docs | map(attribute='JobName', default='(json error in slurmstats data)') }}: 'dict object' has no attribute 'docs'. 'dict object' has no attribute 'docs'

But on that control node:

[root@slurmci-RL8-1030-control rocky]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
extra        up 60-00:00:0      0    n/a 
standard*    up 60-00:00:0      2   idle slurmci-RL8-1030-compute-[0-1]
rebuild      up      30:00      2   idle slurmci-RL8-1030-compute-[0-1]
[root@slurmci-RL8-1030-control rocky]# sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1            pingpong.+   standard                     2  COMPLETED      0:0 
1.batch           batch                                1  COMPLETED      0:0 
1.0               orted                                2  COMPLETED      0:0 
2            rebuild-s+    rebuild       root          1  COMPLETED      0:0 
2.batch           batch                  root          1  COMPLETED      0:0 
3            rebuild-s+    rebuild       root          1  COMPLETED      0:0 
3.batch           batch                  root          1  COMPLETED      0:0

and Grafana is show slurm jobs 1,2,3.

So no idea what's happened there really TBH.

[edit]

OK after updating ansible/ci/retrieve_inventory.yml and running against the CI cluster, I was able to run ansible-playbook ansible/ci/check_grafana.yml with no errors. So, transient network issue??

elelaysh · 2026-05-04T08:54:52Z

After a version using python3.9 interpreter as much as possible to avoid python 3.6 platform-python on RL8, we decided to only use it where platform-python was causing an issue: installing an old pymysql in the mysql role.

there was no expansion in the command: needs to be a shell

to install a recent version of pymysql. This is less intrusive than trying to use python3.9 interpreter as much as possible (need to use platform-python anyway for firewalld and selinux python bindings, which is python 3.6 on RL8)

in our ansible/roles/firewalld, called from bootstrap.yml This was discovered when trying to use python3.9 as ansible interpreter as much as possible on RL8

Ignoring at the moment since projects have not had time to rebuild yet. There is a very high level of false positive on general tools (just check the stdlib version) and no known exploitation. See https://access.redhat.com/security/cve/cve-2026-27143 for status

We don't run rclone with remote control

as a precaution against the many vulnerabilities reported by grype

elelaysh · 2026-05-05T12:47:50Z

build for d58bac2 https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/25317786937

Rebuilt 0.22 version is now free of those

elelaysh · 2026-05-06T10:23:29Z

Build for 06706f3: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/25388565343

elelaysh requested a review from a team as a code owner April 8, 2026 09:21

sjpb reviewed Apr 8, 2026

View reviewed changes

Comment thread ansible/fatimage.yml Outdated

Comment thread requirements.yml Outdated

Comment thread packer/openstack.pkr.hcl Outdated

Comment thread .grype.yaml Outdated

elelaysh force-pushed the feat/upgrade-prometheus branch 4 times, most recently from 9514cba to 7349d4d Compare April 9, 2026 09:58

elelaysh force-pushed the feat/upgrade-prometheus branch 2 times, most recently from 9b980e6 to 0d874c7 Compare April 10, 2026 08:57

elelaysh force-pushed the feat/upgrade-prometheus branch from 0690c0b to f17e89b Compare April 10, 2026 10:53

elelaysh force-pushed the feat/upgrade-prometheus branch 3 times, most recently from b74345f to 5df2cd4 Compare April 14, 2026 08:57

sjpb requested changes Apr 28, 2026

View reviewed changes

sjpb requested changes Apr 30, 2026

View reviewed changes

Comment thread .github/workflows/lint.yml

Comment thread docs/alerting.md

Comment thread packer/openstack.pkr.hcl Outdated

Comment thread packer/openstack.pkr.hcl Outdated

Comment thread packer/openstack.pkr.hcl Outdated

elelaysh force-pushed the feat/upgrade-prometheus branch from 25b66ca to ab3405d Compare May 4, 2026 08:34

elelaysh force-pushed the feat/upgrade-prometheus branch 3 times, most recently from eb9a19d to 9fa37e9 Compare May 5, 2026 10:02

elelaysh added 11 commits May 5, 2026 12:48

Fix Cleanup /tmp

5829afb

there was no expansion in the command: needs to be a shell

Remove .grype.yaml comment now that /tmp is correctly cleaned-up

77f6048

Remove unused packer ansible skip-tags

c091419

Use python3.9 interpreter for mysql role

7c24eb0

to install a recent version of pymysql. This is less intrusive than trying to use python3.9 interpreter as much as possible (need to use platform-python anyway for firewalld and selinux python bindings, which is python 3.6 on RL8)

Note that ansible.posix.firewalld need platform-python

3bc7c0b

in our ansible/roles/firewalld, called from bootstrap.yml This was discovered when trying to use python3.9 as ansible interpreter as much as possible on RL8

Upgrade prometheus to latest LTS 3.5.3

5c2fc8c

Update alertmanager to latest 0.32.1

e843708

Rebuilt images

d58bac2

Ignore CVE-2026-41176 and CVE-2026-41179 rclone rcd,--rc vulnerabilities

d8aeb8e

We don't run rclone with remote control

Use python3.9 interpreter for stackhpc.openhpc upgrade.yml

44572c5

elelaysh force-pushed the feat/upgrade-prometheus branch from 9fa37e9 to 44572c5 Compare May 5, 2026 10:49

sjpb added the requires-imagebuild label May 5, 2026

Replace rclone binary with a no-op script

38ee9fb

as a precaution against the many vulnerabilities reported by grype

elelaysh added 2 commits May 5, 2026 18:20

Rebuilt prometheus-slurm-exporter (0.22)

2411273

Rebuilt images

06706f3

sjpb added the next-to-merge label May 6, 2026

Merge remote-tracking branch 'origin/main' into feat/upgrade-prometheus

11d3c77

sjpb reviewed May 6, 2026

View reviewed changes

Comment thread .grype.yaml

Remove grype ignore prometheus-slurm-exporter

46dfc37

Rebuilt 0.22 version is now free of those

sjpb reviewed May 6, 2026

View reviewed changes

Comment thread .grype.yaml Outdated

Use tagged version of stackhpc/prometheus-community-ansible

9cf872a

sjpb self-requested a review May 6, 2026 13:17

sjpb approved these changes May 6, 2026

View reviewed changes

sjpb enabled auto-merge (squash) May 6, 2026 13:18

sjpb merged commit c49855f into main May 6, 2026
34 of 38 checks passed

sjpb deleted the feat/upgrade-prometheus branch May 6, 2026 13:36

elelaysh mentioned this pull request May 7, 2026

Draft: Fix security issues flagged by grype #940

Closed

Conversation

elelaysh commented Apr 8, 2026 • edited by sjpb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elelaysh commented Apr 8, 2026

Uh oh!

elelaysh commented Apr 8, 2026

Uh oh!

sjpb commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elelaysh commented Apr 9, 2026

Uh oh!

elelaysh commented Apr 10, 2026

Uh oh!

sjpb commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjpb commented Apr 10, 2026

Uh oh!

elelaysh commented Apr 14, 2026

Uh oh!

elelaysh commented Apr 15, 2026

Uh oh!

elelaysh commented Apr 15, 2026

Uh oh!

sjpb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjpb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjpb commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elelaysh commented May 4, 2026

Uh oh!

elelaysh commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

elelaysh commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elelaysh commented Apr 8, 2026 •

edited by sjpb

Loading

sjpb commented Apr 8, 2026 •

edited

Loading

sjpb commented Apr 10, 2026 •

edited

Loading

sjpb commented Apr 30, 2026 •

edited

Loading