Upgrade prometheus stack to latest LTS versions#943
Conversation
|
do we want to store checksums or are we happy to download them alongside binaries from github? |
|
Shall I remove our cloudalchemy.node_exporter cloudalchemy.prometheus alertmanager in this PR? |
Is all the actual config backward-compatible, across all 3x |
9514cba to
7349d4d
Compare
9b980e6 to
0d874c7
Compare
0690c0b to
f17e89b
Compare
and that doesn't exist yet
|
|
@ 748f820, confirmed monitoring.yml does complete! TODO: should check the prom propagate binaries approach - I think this seemed bad before for large clusters, with the old cloudalchemy roles and we sort of patched it out/disabled it? But I don't remember the details TBH, maybe something about multiple downloads. But hopefully whatever we're doing to just do the install during image build should make this ok, would just like to check. |
b74345f to
5df2cd4
Compare
It is backward-compatible for prometheus, node_exporter. alertmanager diverged more. See common/alertmanager.yml config. |
Please see our ansible-prometheus vs prometheus-community. They seem pretty much identical to me: download,checksum,unpack on ansible host, copy to all hosts. True: we wouldn't care if it was downloaded on each host due to the install.yml only run on the builder to produce the image. |
sjpb
left a comment
There was a problem hiding this comment.
I think the initial PR comment should also note the following "incidental" changes:
- Updates all OS packages
- Cleanup of /tmp in ansible/cleanup.yml has been fixed
- Removes packages cockpit-system cockpit-bridge
- Alertmanager db now persisted in appliances state directory (so alert state e.g. silence will be persisted across control node rebuilds)
- Adds ignore list for grype vulnerability scanner
sjpb
left a comment
There was a problem hiding this comment.
As per separate discussion, I feel the python intepreter logic is really hard to reason about, and e.g. means pre.yml runs under different interpreters depending on how it is called, which seems like its going to cause difficult problems at some point.
Decided to just run mysql under 3.9, which is needed to avoid a vuln in in that package.
|
CI failure for RL8: But on that control node: and Grafana is show slurm jobs 1,2,3. So no idea what's happened there really TBH. [edit] OK after updating ansible/ci/retrieve_inventory.yml and running against the CI cluster, I was able to run |
25b66ca to
ab3405d
Compare
|
After a version using python3.9 interpreter as much as possible to avoid python 3.6 platform-python on RL8, we decided to only use it where platform-python was causing an issue: installing an old pymysql in the mysql role. |
eb9a19d to
9fa37e9
Compare
there was no expansion in the command: needs to be a shell
to install a recent version of pymysql. This is less intrusive than trying to use python3.9 interpreter as much as possible (need to use platform-python anyway for firewalld and selinux python bindings, which is python 3.6 on RL8)
in our ansible/roles/firewalld, called from bootstrap.yml This was discovered when trying to use python3.9 as ansible interpreter as much as possible on RL8
Ignoring at the moment since projects have not had time to rebuild yet. There is a very high level of false positive on general tools (just check the stdlib version) and no known exploitation. See https://access.redhat.com/security/cve/cve-2026-27143 for status
We don't run rclone with remote control
9fa37e9 to
44572c5
Compare
as a precaution against the many vulnerabilities reported by grype
Rebuilt 0.22 version is now free of those
Fix issues found via grype; switch to (our fork of) the prometheus-community ansible collection.
rclonebinary with a no-op script to avoid vulnerabilities.grype.yamlNote the grype allowlist is at
.grype.yamland includes some vulnerabilities which have no fix at this time.