Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump: 24.05.6 release #70

Merged
merged 16 commits into from
Feb 26, 2025
Merged

Bump: 24.05.6 release #70

merged 16 commits into from
Feb 26, 2025

Conversation

itkovian
Copy link
Member

No description provided.

tripiana and others added 16 commits January 29, 2025 20:30
Suspended jobs do not get removed from node usage so, it you cancel it after
that, there's a pointer there to a finished job.

This causes two issues:

1. Can prevent from running the evaluated job.
2. If the deleted job is purged, any attempts to read its contents will lead
to bad data and potential crash.

In the related ticket, _is_job_sharing was segfaulting.

Changelog: Fix crash and issues evaluating job's suitability for running in
 nodes with already suspended job(s) there.
Ticket: 21767
Cherry-picked: 19d9185
Cherry-pick !428 into slurm-24.05

See merge request SchedMD/dev/slurm!506
Cherry-pick !516 into slurm-24.05

See merge request SchedMD/dev/slurm!518
When a job taking 2 or more nodes had all of its nodes fail, and no
EpilogSlurmctld was configured, job requeuing was not correctly processed
as batch_requeue_fini was not called. This resulted in the following issues:

- Requeued job was not assigned a new SLUID.
- Job steps of new jobs were not being reset to 0.

This left incorrect entries in the accounting database for the requeued job.
Added a batch_requeue_fini call to fix that.

Ticket: 20177
Changelog: Fixed a job requeuing issue that merged job entries into the
 same SLUID when all nodes in a job failed simultaneously.
Cherry-picked: d7c0dfc
Cherry-pick !322 into slurm-24.05

See merge request SchedMD/dev/slurm!541
Newer cxi drivers changed the kernel module to "cxi_ss1". To maintain
support for new and old drivers, first attempt the new location then
attempt the old one when checking rdzv_get_en_default.

Changelog: switch/hpe_slingshot - Fix compatibility with newer cxi
 drivers, specifically when specifying disable_rdzv_get.
Ticket: 22087
Cherry-picked: e8ed3df
Cherry-pick !579 into slurm-24.05

See merge request SchedMD/dev/slurm!582
Trigger abort() rather than exit() for any fatal() message.

Changelog: Add ABORT_ON_FATAL environment variable to capture a backtrace
  from any fatal() message.
Issue: 50181
Ticket: 21582
Cherry-picked: 5666caa
Cherry-pick !575 into slurm-24.05

See merge request SchedMD/dev/slurm!586
Cherry-pick !615 into slurm-24.05

See merge request SchedMD/dev/slurm!629
Update slurm.spec and debian/changelog as well.
@wdpypere wdpypere merged commit 317c71a into hpcugent:24.05.ug Feb 26, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants