Skip to content

Upgrade prometheus stack to latest LTS versions#943

Merged
sjpb merged 59 commits into
mainfrom
feat/upgrade-prometheus
May 6, 2026
Merged

Upgrade prometheus stack to latest LTS versions#943
sjpb merged 59 commits into
mainfrom
feat/upgrade-prometheus

Conversation

@elelaysh
Copy link
Copy Markdown
Contributor

@elelaysh elelaysh commented Apr 8, 2026

Fix issues found via grype; switch to (our fork of) the prometheus-community ansible collection.

  • Updates all OS packages
  • Cleanup of /tmp in ansible/cleanup.yml has been fixed
  • Removes packages cockpit-system cockpit-bridge
  • Replaces rclone binary with a no-op script to avoid vulnerabilities
  • Alertmanager db now persisted in appliances state directory (so alert state e.g. silence will be persisted across control node rebuilds)
  • Adds ignore list for grype vulnerability scanner, see .grype.yaml
  • Upgrades to prometheus 3.5.3, alertmanager 0.32.1, node_exporter 1.11.1
  • Pins workflows for compliance with organisation allowlist

Note the grype allowlist is at .grype.yaml and includes some vulnerabilities which have no fix at this time.

@elelaysh elelaysh requested a review from a team as a code owner April 8, 2026 09:21
@elelaysh
Copy link
Copy Markdown
Contributor Author

elelaysh commented Apr 8, 2026

do we want to store checksums or are we happy to download them alongside binaries from github?

@elelaysh
Copy link
Copy Markdown
Contributor Author

elelaysh commented Apr 8, 2026

Shall I remove our cloudalchemy.node_exporter cloudalchemy.prometheus alertmanager in this PR?

@sjpb
Copy link
Copy Markdown
Collaborator

sjpb commented Apr 8, 2026

do we want to store checksums or are we happy to download them alongside binaries from github?
I think we should store checksums in ansible variables, along with versions.

Shall I remove our cloudalchemy.node_exporter cloudalchemy.prometheus alertmanager in this PR?
Yes I think so - for my reference the first two are just requirements.yml and alertmanager is an actual role.

Is all the actual config backward-compatible, across all 3x repos roles?

Comment thread ansible/fatimage.yml Outdated
Comment thread requirements.yml Outdated
Comment thread packer/openstack.pkr.hcl Outdated
Comment thread .grype.yaml Outdated
@elelaysh elelaysh force-pushed the feat/upgrade-prometheus branch 4 times, most recently from 9514cba to 7349d4d Compare April 9, 2026 09:58
@elelaysh
Copy link
Copy Markdown
Contributor Author

elelaysh commented Apr 9, 2026

@elelaysh elelaysh force-pushed the feat/upgrade-prometheus branch 2 times, most recently from 9b980e6 to 0d874c7 Compare April 10, 2026 08:57
@elelaysh
Copy link
Copy Markdown
Contributor Author

@elelaysh elelaysh force-pushed the feat/upgrade-prometheus branch from 0690c0b to f17e89b Compare April 10, 2026 10:53
@sjpb
Copy link
Copy Markdown
Collaborator

sjpb commented Apr 10, 2026

  • Deployed cluster from main @ 602a830
  • Ran hpctests:pingpong
  • Updated to ade3802, ran setup-env
  • Reimaged cluster to new image: ansible-playbook ansible/adhoc/rebuild.yml -e rebuild_image=openhpc-RL9-260410-0859-0d874c74
  • Ran ansible compute -ba "touch /var/lib/ansible-init.done" b/c .stackhpc env configured for compute-init, and I'd accidently already started site.
  • Ran site.yml - FAILS because it is trying to do:
- name: Configure alertmanager
      # TODO: compare with prometheus.prometheus.alertmanager
      ansible.builtin.include_role:
        name: alertmanager
        tasks_from: configure.yml

and that doesn't exist yet

  • Confirmed pre-rebuild pingpong job shows in slurm job dashboard
  • Confirmed do see network and CPU data when clicking thro to the job
  • Ran hpctests: pingmatrix
  • Confirmed job shows in slurm jobs dashboard
  • Confirmed do see network and CPU data when clicking thro to the job

@sjpb
Copy link
Copy Markdown
Collaborator

sjpb commented Apr 10, 2026

@ 748f820, confirmed monitoring.yml does complete!

TODO: should check the prom propagate binaries approach - I think this seemed bad before for large clusters, with the old cloudalchemy roles and we sort of patched it out/disabled it? But I don't remember the details TBH, maybe something about multiple downloads. But hopefully whatever we're doing to just do the install during image build should make this ok, would just like to check.

@elelaysh elelaysh force-pushed the feat/upgrade-prometheus branch 3 times, most recently from b74345f to 5df2cd4 Compare April 14, 2026 08:57
@elelaysh
Copy link
Copy Markdown
Contributor Author

@elelaysh
Copy link
Copy Markdown
Contributor Author

Is all the actual config backward-compatible, across all 3x repos roles?

It is backward-compatible for prometheus, node_exporter. alertmanager diverged more. See common/alertmanager.yml config.

@elelaysh
Copy link
Copy Markdown
Contributor Author

TODO: should check the prom propagate binaries approach - I think this seemed bad before for large clusters, with the old cloudalchemy roles and we sort of patched it out/disabled it? But I don't remember the details TBH, maybe something about multiple downloads. But hopefully whatever we're doing to just do the install during image build should make this ok, would just like to check.

Please see our ansible-prometheus vs prometheus-community. They seem pretty much identical to me: download,checksum,unpack on ansible host, copy to all hosts. True: we wouldn't care if it was downloaded on each host due to the install.yml only run on the builder to produce the image.

Copy link
Copy Markdown
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the initial PR comment should also note the following "incidental" changes:

  • Updates all OS packages
  • Cleanup of /tmp in ansible/cleanup.yml has been fixed
  • Removes packages cockpit-system cockpit-bridge
  • Alertmanager db now persisted in appliances state directory (so alert state e.g. silence will be persisted across control node rebuilds)
  • Adds ignore list for grype vulnerability scanner

Comment thread .github/workflows/extra.yml Outdated
Comment thread ansible/fatimage.yml Outdated
Comment thread ansible/bootstrap.yml Outdated
Comment thread ansible/bootstrap.yml Outdated
Comment thread docs/alerting.md
Comment thread packer/openstack.pkr.hcl Outdated
Comment thread packer/openstack.pkr.hcl Outdated
Comment thread packer/openstack.pkr.hcl Outdated
Copy link
Copy Markdown
Collaborator

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per separate discussion, I feel the python intepreter logic is really hard to reason about, and e.g. means pre.yml runs under different interpreters depending on how it is called, which seems like its going to cause difficult problems at some point.

Decided to just run mysql under 3.9, which is needed to avoid a vuln in in that package.

Comment thread .github/workflows/lint.yml
Comment thread docs/alerting.md
Comment thread packer/openstack.pkr.hcl Outdated
Comment thread packer/openstack.pkr.hcl Outdated
Comment thread packer/openstack.pkr.hcl Outdated
@sjpb
Copy link
Copy Markdown
Collaborator

sjpb commented Apr 30, 2026

CI failure for RL8:

TASK [Query grafana for expected hpctests jobs] ********************************
task path: /home/runner/work/ansible-slurm-appliance/ansible-slurm-appliance/ansible/ci/check_grafana.yml:14
Thursday 30 April 2026  09:39:32 +0000 (0:00:00.738)       0:00:00.782 ******** 
fatal: [slurmci-RL8-1030-control]: FAILED! => {}

MSG:

The conditional check '_expected_jobs | difference(_found_jobs) == []' failed. The error was: error while evaluating conditional (_expected_jobs | difference(_found_jobs) == []): {{ _slurm_stats_jobs.docs | map(attribute='JobName', default='(json error in slurmstats data)') }}: 'dict object' has no attribute 'docs'. 'dict object' has no attribute 'docs'. {{ _slurm_stats_jobs.docs | map(attribute='JobName', default='(json error in slurmstats data)') }}: 'dict object' has no attribute 'docs'. 'dict object' has no attribute 'docs'

But on that control node:

[root@slurmci-RL8-1030-control rocky]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
extra        up 60-00:00:0      0    n/a 
standard*    up 60-00:00:0      2   idle slurmci-RL8-1030-compute-[0-1]
rebuild      up      30:00      2   idle slurmci-RL8-1030-compute-[0-1]
[root@slurmci-RL8-1030-control rocky]# sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1            pingpong.+   standard                     2  COMPLETED      0:0 
1.batch           batch                                1  COMPLETED      0:0 
1.0               orted                                2  COMPLETED      0:0 
2            rebuild-s+    rebuild       root          1  COMPLETED      0:0 
2.batch           batch                  root          1  COMPLETED      0:0 
3            rebuild-s+    rebuild       root          1  COMPLETED      0:0 
3.batch           batch                  root          1  COMPLETED      0:0 

and Grafana is show slurm jobs 1,2,3.

So no idea what's happened there really TBH.

[edit]

OK after updating ansible/ci/retrieve_inventory.yml and running against the CI cluster, I was able to run ansible-playbook ansible/ci/check_grafana.yml with no errors. So, transient network issue??

@elelaysh elelaysh force-pushed the feat/upgrade-prometheus branch from 25b66ca to ab3405d Compare May 4, 2026 08:34
@elelaysh
Copy link
Copy Markdown
Contributor Author

elelaysh commented May 4, 2026

After a version using python3.9 interpreter as much as possible to avoid python 3.6 platform-python on RL8, we decided to only use it where platform-python was causing an issue: installing an old pymysql in the mysql role.

@elelaysh elelaysh force-pushed the feat/upgrade-prometheus branch 3 times, most recently from eb9a19d to 9fa37e9 Compare May 5, 2026 10:02
elelaysh added 11 commits May 5, 2026 12:48
there was no expansion in the command: needs to be a shell
to install a recent version of pymysql.
This is less intrusive than trying to use python3.9 interpreter as much as possible
(need to use platform-python anyway for firewalld and selinux python bindings, which is python 3.6 on RL8)
in our ansible/roles/firewalld, called from bootstrap.yml
This was discovered when trying to use python3.9 as ansible interpreter
as much as possible on RL8
Ignoring at the moment since projects have not had time to rebuild yet.
There is a very high level of false positive on general tools
(just check the stdlib version) and no known exploitation.
See https://access.redhat.com/security/cve/cve-2026-27143 for status
@elelaysh elelaysh force-pushed the feat/upgrade-prometheus branch from 9fa37e9 to 44572c5 Compare May 5, 2026 10:49
as a precaution against the many vulnerabilities reported by grype
@elelaysh
Copy link
Copy Markdown
Contributor Author

elelaysh commented May 5, 2026

Comment thread .grype.yaml
Rebuilt 0.22 version is now free of those
Comment thread .grype.yaml Outdated
@elelaysh
Copy link
Copy Markdown
Contributor Author

elelaysh commented May 6, 2026

@sjpb sjpb self-requested a review May 6, 2026 13:17
@sjpb sjpb enabled auto-merge (squash) May 6, 2026 13:18
@sjpb sjpb merged commit c49855f into main May 6, 2026
34 of 38 checks passed
@sjpb sjpb deleted the feat/upgrade-prometheus branch May 6, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants