Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm changes for omnia 2.0 #2479

Open
wants to merge 31 commits into
base: pub/new_architecture
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
896a6c2
Slurm in a single role
jagadeeshnv Feb 14, 2025
f01658e
Lint fixes
jagadeeshnv Feb 14, 2025
977f880
Clean up flag reset
jagadeeshnv Feb 14, 2025
80ff376
Remove unused
jagadeeshnv Feb 14, 2025
f9244a8
pytorch python version fix
jagadeeshnv Feb 18, 2025
4680b9e
[OMN01B-171]: Install slurmdbd with existing database
Cypher-Miller Feb 27, 2025
d8f04d9
[OMN01B-171]: Uncommented K8 code
Cypher-Miller Feb 27, 2025
05e8b27
[OMN01B-208]: Add slurm ansible code to create new DB if
Cypher-Miller Feb 27, 2025
3ba242b
[OMN01B-208]: Conditional added to only add db when db_host is provided
Cypher-Miller Feb 27, 2025
6136805
Add node complete
jagadeeshnv Feb 27, 2025
b414b25
Dbd code uncommented
jagadeeshnv Feb 27, 2025
b673280
Remove Node done
jagadeeshnv Feb 28, 2025
2e2a860
Add node fixed scontrol error
jagadeeshnv Feb 28, 2025
55e1646
Moved db.yml task entry point, added default behavior for db_port and…
Cypher-Miller Feb 28, 2025
66fad29
Merge branch 'pub/new_architecture' of github.com:jagadeeshnv/omnia i…
Cypher-Miller Feb 28, 2025
4ec9c79
Updated some slurm var descriptions
Cypher-Miller Feb 28, 2025
a2a0be7
Cleanup of _config_files.yml
jagadeeshnv Mar 2, 2025
6fd0a03
Cleanup of cleanll
jagadeeshnv Mar 2, 2025
9a28e22
Share dir creation synchronized
jagadeeshnv Mar 2, 2025
a2e6c8c
Debug statements cleaned
jagadeeshnv Mar 2, 2025
9e00ce9
Fixed Add db user tasks to successfully connect to mariadb db
Cypher-Miller Mar 3, 2025
c16d6d2
Merge branch 'pub/new_architecture' of github.com:jagadeeshnv/omnia i…
Cypher-Miller Mar 3, 2025
fe9355f
Added create new db user; Moved slurmdbd.conf creation code
Cypher-Miller Mar 3, 2025
9e70e82
Fixed typo causing error when creating slurmdbd.conf
Cypher-Miller Mar 3, 2025
9fb669a
benchmark tools openmpi command simplified
jagadeeshnv Mar 3, 2025
7d433fe
Changed db_ to slurm_db_; Made slurmdbd db user's privileges more ris…
Cypher-Miller Mar 3, 2025
358ecc2
Merge branch 'pub/new_architecture' of github.com:jagadeeshnv/omnia i…
Cypher-Miller Mar 3, 2025
bf73c31
Added support for db ports other than 3306
Cypher-Miller Mar 3, 2025
a30ec2c
Fixed issue where slurmctld service would sometimes not restart when …
Cypher-Miller Mar 3, 2025
e387f57
Reverted restart logic for slurmdbd and slurmctld
Cypher-Miller Mar 4, 2025
5376453
additional check for specific slurmd restarts
jagadeeshnv Mar 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions input/config/rhel/9.4/openmpi.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"openmpi": {
"cluster": [
{ "package": "openmpi",
"type": "tarball",
"url": "https://download.open-mpi.org/release/open-mpi/v{{ openmpi_version.split('.')[:2] | join('.') }}/openmpi-{{ openmpi_version }}.tar.gz"
},
{"package": "gcc-c++", "type": "rpm", "repo_name": "appstream"},
{"package": "clang", "type": "rpm", "repo_name": "appstream"}
]
}
}
32 changes: 32 additions & 0 deletions input/config/rhel/9.4/slurm.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"slurm": {
"cluster": [
{"package": "munge", "type": "rpm", "repo_name": "appstream"},
{"package": "firewalld", "type": "rpm", "repo_name": "baseos"},
{"package": "python3-firewall", "type": "rpm", "repo_name": "baseos"}
]
},
"slurm_control_node": {
"cluster": [
{"package": "slurm-slurmctld", "type": "rpm", "repo_name": "epel"}
]
},
"slurm_node": {
"cluster": [
{"package": "slurm-slurmd", "type": "rpm", "repo_name": "epel"}
]
},
"slurmdbd":{
"cluster": [
{"package": "slurm-slurmdbd", "type": "rpm", "repo_name": "epel"},
{"package": "python3-PyMySQL", "type": "rpm", "repo_name": "appstream"},
{"package": "mysql-server", "type": "rpm", "repo_name": "appstream"},
{"package": "mariadb-server", "type": "rpm", "repo_name": "appstream"}
]
},
"login":{
"cluster": [
{"package": "slurm", "type": "rpm", "repo_name": "epel"}
]
}
}
11 changes: 11 additions & 0 deletions input/config/rhel/9.4/ucx.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"ucx": {
"cluster": [
{ "package": "ucx",
"type": "tarball",
"url": "https://github.com/openucx/ucx/releases/download/v{{ ucx_version }}/ucx-{{ ucx_version }}.tar.gz"
},
{"package": "gcc-c++", "type": "rpm", "repo_name": "appstream"}
]
}
}
16 changes: 14 additions & 2 deletions input/omnia_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,22 @@ ansible_config_file_path: "/etc/ansible"

# -----------------------------SLURM------------------------------------------------

# Password used for Slurm database.
# Username and password used for Slurm database.
# The Length of the password should be at least 8.
# The password must not contain -,\, ',"
mariadb_password: "password"
slurm_db_username: root
slurm_db_password: ""

# Host and port of the Slurm database.
# If no database is intalled on the given node, one will be created.
# Defaults to the slurmdbd host if no host is given.
# Defaults to 3306 if no port is given.
slurm_db_host:
slurm_db_port: 3306

# Type of database to be used by Slurm.
# Options are mysql or mariadb. Defaults to mariadb.
slurm_db_type: mariadb

# This variable accepts whether slurm installation is supported in configless mode or slurm in nfs
# Default value is "configless"
Expand Down
23 changes: 23 additions & 0 deletions scheduler/add_node.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
- name: Add nodes Slurm
hosts: slurm_control_node, slurm_node, login, slurm_dbd
any_errors_fatal: true
vars:
share_mounted_path: "{{ hostvars['localhost']['share_path'] | default('/home') }}" # from storage.yml
#TODO: nfs_client role here, slurm depends on a mandatory share path
pre_tasks:
- name: Include input project directory
ansible.builtin.import_role:
name: ../utils/roles/include_input_dir
run_once: true
delegate_to: localhost
- name: Include vars omnia_config.yml
ansible.builtin.include_vars:
file: "{{ input_project_dir }}/omnia_config.yml"
run_once: true
delegate_to: localhost
tasks:
- name: Add node
ansible.builtin.import_role:
name: slurm
tasks_from: add_node
23 changes: 23 additions & 0 deletions scheduler/remove_node.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
- name: Remove nodes Slurm
hosts: slurm_control_node, slurm_node
any_errors_fatal: true
vars:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are having separate utility for remove node, we should handle there

share_mounted_path: "{{ hostvars['localhost']['share_path'] | default('/home') }}" # from storage.yml
#TODO: nfs_client role here, slurm depends on a mandatory share path
pre_tasks:
- name: Include input project directory
ansible.builtin.import_role:
name: ../utils/roles/include_input_dir
run_once: true
delegate_to: localhost
- name: Include vars omnia_config.yml
ansible.builtin.include_vars:
file: "{{ input_project_dir }}/omnia_config.yml"
run_once: true
delegate_to: localhost
tasks:
- name: Remove node
ansible.builtin.import_role:
name: slurm
tasks_from: rm_node
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,9 @@
- ucx_dir_data.stat.exists
- ucx_cmd.rc == 0

- name: Construct the command to compile the openmpi when slurm support is true
when: slurm_support
ansible.builtin.include_tasks: openmpi_cmd_with_slurm.yml

- name: Construct the command to compile the openmpi when slurm support is false
when: not slurm_support
ansible.builtin.include_tasks: openmpi_cmd_without_slurm.yml
- name: Construct openmpi compile the command with ucx and slurm
ansible.builtin.set_fact:
openmpi_compile_cmd: "./configure --prefix={{ omnia_share_path }}/{{ benchmarks_dir_openmpi }} --enable-mpi1-compatibility --enable-prte-prefix-by-default {{ '--with-slurm=yes' if slurm_support else '--with-slurm=no' }}{{ ' --with-ucx='+ucx_dir_data.stat.path+' ' if ucx_installed else ' ' }}CC=clang CXX=clang++ 2>&1 | tee config.out"

- name: Create a build directory inside openmpi folder
ansible.builtin.file:
Expand Down

This file was deleted.

This file was deleted.

37 changes: 37 additions & 0 deletions scheduler/roles/slurm/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
# defaults file for slurm
mpi: {}
cgroup: {}
__cgroup_default_config:
CgroupPlugin: autodetect
ConstrainCores: true
ConstrainDevices: true
ConstrainRAMSpace: true
ConstrainSwapSpace: true
__slurm_default_config:
SlurmUser: "{{ slurm_user }}"
SlurmctldPort: 6817
SlurmdPort: 6818
SrunPortRange: "60001-63000"
StateSaveLocation: "/var/spool/state"
SlurmdSpoolDir: "/var/spool/slurmd"
ReturnToService: 2
SchedulerType: sched/backfill
MpiDefault: None
ProctrackType: proctrack/cgroup
SelectType: select/linear
SlurmctldLogFile: /var/log/slurmctld.log
SlurmdLogFile: /var/log/slurmd.log
SlurmctldPidFile: /var/run/slurmctld.pid
SlurmdPidFile: /var/run/slurmd.pid
AuthType: auth/munge
CryptoType: crypto/munge
SlurmctldTimeout: 120
SlurmdTimeout: 300
__slurm_dbd_default_config:
AuthType: auth/munge
LogFile: /var/log/slurmdbd.log
PidFile: /var/run/slurmdbd.pid
SlurmUser: "{{ slurm_user }}"
StorageType: accounting_storage/mysql
StorageLoc: slurm_acct_db
70 changes: 70 additions & 0 deletions scheduler/roles/slurm/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
# handlers file for slurm
- name: Restart chrony
ansible.builtin.systemd_service:
name: "{{ 'chronyd' if ansible_os_family == 'RedHat' else 'chrony' }}"
state: restarted
enabled: true

- name: Restart munge
ansible.builtin.systemd_service:
name: munge
state: restarted

- name: Restart mysqld
ansible.builtin.systemd_service:
name: mysqld
state: restarted
run_once: true
when: restart_slurm_services
delegate_to: "{{ slurm_db_host }}"

- name: Restart mariadb
ansible.builtin.systemd_service:
name: mariadb
state: restarted
run_once: true
when: restart_slurm_services
delegate_to: "{{ slurm_db_host }}"

- name: Reload slurmdbd
ansible.builtin.systemd_service:
name: slurmdbd
state: reloaded
daemon_reload: true
enabled: true
when: restart_slurm_services and ('slurm_dbd' in group_names)

- name: Restart slurmdbd
ansible.builtin.systemd_service:
name: slurmdbd
state: restarted
when: restart_slurm_services and ('slurm_dbd' in group_names)

- name: Reload slurmctld
ansible.builtin.systemd_service:
name: slurmctld
state: reloaded
daemon_reload: true
enabled: true
when: restart_slurm_services and ('slurm_control_node' in group_names)

- name: Restart slurmctld
ansible.builtin.systemd_service:
name: slurmctld
state: restarted
when: restart_slurm_services and ('slurm_control_node' in group_names)

- name: Reload slurmd
ansible.builtin.systemd_service:
name: slurmd
state: reloaded
daemon_reload: true
enabled: true
when: restart_slurm_services and ('slurm_node' in group_names)

- name: Restart slurmd
ansible.builtin.systemd_service:
name: slurmd
state: restarted
when: restart_slurm_services and ('slurm_node' in group_names)
Loading