Skip to content

Lustre mount via Ansible for SMHP Slurm LCS #682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
8a73b80
Lustre mount via Ansible
amanshanbhag May 15, 2025
15cc3a8
Fixed automount bug for fsxl + ansible
amanshanbhag May 16, 2025
715c6e4
Ansible for systemd calls
amanshanbhag May 16, 2025
b6ca6d5
Resolving ozfs conflicts
amanshanbhag May 27, 2025
51cd4e9
Merge branch 'main' into lustre-ansible
amanshanbhag May 27, 2025
4918505
Syncing with latest pushes to repo
amanshanbhag May 28, 2025
7a37354
Merge main into branch for #682 (#704)
amanshanbhag May 28, 2025
97885b9
Merging from main into local branch
amanshanbhag May 28, 2025
0d2efa4
Merge changes from main
amanshanbhag May 28, 2025
0a1aeab
small fix for #705
amanshanbhag May 28, 2025
1003b6b
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
357d413
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
a2f0a8b
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
00045a2
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
5217c0b
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
73146bc
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
269efa6
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
72fad2b
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
36498ea
Update 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-con…
amanshanbhag Jun 9, 2025
ce3022b
Merge branch 'main' into lustre-ansible
amanshanbhag Jun 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,8 @@ def main(args):
params = ProvisioningParameters(args.provisioning_parameters)
resource_config = ResourceConfig(args.resource_config)

ExecuteBashScript("./utils/install_ansible.sh").run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this getting installed? need architecture diagram and readme

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify, does ansible get installed on every node?

Copy link
Contributor Author

@amanshanbhag amanshanbhag Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecture diagram for what? Yes, ansible, the package, is installed on every node (hence localhost to mount in inventory)


fsx_dns_name, fsx_mountname = params.fsx_settings
if fsx_dns_name and fsx_mountname:
print(f"Mount fsx: {fsx_dns_name}. Mount point: {fsx_mountname}")
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

# must be run a sudo
# must be run as sudo

set -x
set -e
Expand All @@ -10,123 +10,68 @@ FSX_DNS_NAME="$1"
FSX_MOUNTNAME="$2"
MOUNT_POINT="$3"

is_mounted() {
mountpoint -q "$1"
return $?
# Function for error handling
handle_error()
{
local exit_code=$?
echo "Error occurred in command: $BASH_COMMAND"
echo "Exit code: $exit_code"
exit $exit_code
}

check_already_mounted() {
# Check if FSx is already mounted to $MOUNT_POINT
if is_mounted $MOUNT_POINT; then
if grep -qs "$FSX_MOUNTNAME $MOUNT_POINT lustre" /proc/mounts; then
echo "FSx Lustre already mounted to $MOUNT_POINT. Exiting."
exit 0
else
echo "$MOUNT_POINT is mounted, but not to mountname: $FSX_MOUNTNAME from provisioning_parameters.json. Exiting."
exit 1
fi
fi
}
trap handle_error ERR

is_fsx_reachable() {
if lctl ping "$FSX_DNS_NAME"; then
echo "FSx is reachable"
else
echo "FSx is not reachable, Trying to mount system anyway"
fi
# DEBUG: Verify parameters are set
verify_parameters()
{
if [ -z "$FSX_DNS_NAME" ] || [ -z "$FSX_MOUNTNAME" ] || [ -z "$MOUNT_POINT" ]; then
echo "Usage: $0 <fsx_dns_name> <fsx_mountname> <mount_point>"
exit 1
fi
}

add_to_fstab() {
# Add FSx to /etc/fstab
echo "$FSX_DNS_NAME@tcp:/$FSX_MOUNTNAME $MOUNT_POINT lustre defaults,noatime,flock,_netdev 0 0" | tee -a /etc/fstab
# Print Lustre client version
print_lustre_version()
{
echo "Lustre client version:"
modinfo lustre | grep 'version:' | head -n 1 | awk '{print $2}'
}

mount_fs() {
if [[ ! -d $MOUNT_POINT ]]; then
mkdir -p $MOUNT_POINT
chmod 644 $MOUNT_POINT
fi

if mount -t lustre -o noatime,flock "$FSX_DNS_NAME"@tcp:/"$FSX_MOUNTNAME" "$MOUNT_POINT"; then
if ! is_mounted $MOUNT_POINT ;then
echo "Mounting FSx to $MOUNT_POINT directory successful, but mountpoint was not detected. Exiting."
exit 1
fi
else
echo "FAILED to mount, FSX to $MOUNT_POINT directory. Exiting."
exit 1
fi
# Load lnet modules
load_lnet_modules()
{
ansible localhost -b -m ansible.builtin.modprobe -a "name=lnet state=present"
}

# Mount the FSx Lustre file system using Ansible
mount_fs()
{
ansible localhost -b -m ansible.posix.mount -a "path=$MOUNT_POINT src=$FSX_DNS_NAME@tcp:/$FSX_MOUNTNAME fstype=lustre opts=noatime,flock,_netdev,x-systemd.automount,x-systemd.requires=network-online.target dump=0 passno=0 state=mounted"

load_lnet_modules() {
modprobe -v lnet
# Trigger automount by accessing the filesystem
echo "Triggering automount by accessing $MOUNT_POINT..."
ls -la $MOUNT_POINT >/dev/null 2>&1 || true && ansible localhost -m ansible.builtin.file -a "path=$MOUNT_POINT/test_file state=touch" && ansible localhost -m ansible.builtin.file -a "path=$MOUNT_POINT/test_file state=absent"
}

# create a systemd service to check mount periodically and remount FSx if necessary
# To stop the service, run:
# `systemctl stop check_mount.service`
# To disable the service, run:
# `systemctl disable check_mount.service`
install_remount_service() {

if [[ ! -d /opt/ml/scripts ]]; then
mkdir -p /opt/ml/scripts
chmod 644 /opt/ml/scripts
echo "Created dir /opt/ml/scripts"
fi

CHECK_MOUNT_FILE=/opt/ml/scripts/check_mount_$FSX_MOUNTNAME.sh

cat > $CHECK_MOUNT_FILE << EOF
#!/bin/bash
MOUNT_POINT=$MOUNT_POINT
if ! grep -qs "$MOUNT_POINT" /proc/mounts; then
mount -t lustre -o noatime,flock "$FSX_DNS_NAME"@tcp:/"$FSX_MOUNTNAME" "$MOUNT_POINT"
echo "Mounted FSx to $MOUNT_POINT"
else
echo "FSx Lustre already mounted to $MOUNT_POINT. Stopping services check_fsx_mount_$FSX_MOUNTNAME.timer and check_fsx_mount_$FSX_MOUNTNAME.service"
systemctl stop check_fsx_mount_$FSX_MOUNTNAME.timer
fi
EOF

chmod +x $CHECK_MOUNT_FILE

cat > /etc/systemd/system/check_fsx_mount_$FSX_MOUNTNAME.service << EOF
[Unit]
Description=Check and remount FSx Lustre filesystems if necessary

[Service]
ExecStart=$CHECK_MOUNT_FILE
EOF

cat > /etc/systemd/system/check_fsx_mount_$FSX_MOUNTNAME.timer << EOF
[Unit]
Description=Run check_fsx_mount_$FSX_MOUNTNAME.service every minute

[Timer]
OnBootSec=1min
OnUnitActiveSec=1min

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now check_fsx_mount_$FSX_MOUNTNAME.timer
restart_daemon()
{
ansible localhost -b -m ansible.builtin.systemd -a "daemon_reload=yes"
ansible localhost -b -m ansible.builtin.systemd -a "name=remote-fs.target state=restarted"
# Readable status check
echo "Check status of fsx automount service..."
systemctl status fsx.automount
}

main() {
echo "Mount_fsx called fsx_dns_name: $FSX_DNS_NAME, fsx_mountname: $FSX_MOUNTNAME"
echo "Using mount_point: $MOUNT_POINT"
load_lnet_modules
check_already_mounted
is_fsx_reachable
add_to_fstab
mount_fs
install_remount_service
echo "FSx Lustre mounted successfully to $MOUNT_POINT"
main()
{
verify_parameters
echo "Mount_fsx called with fsx_dns_name: $FSX_DNS_NAME, fsx_mountname: $FSX_MOUNTNAME"
echo "Using mount_point: $MOUNT_POINT"
echo "LUSTRE CLIENT CONFIGURATION $(print_lustre_version)"
load_lnet_modules
mount_fs
restart_daemon
echo "FSx Lustre mounted successfully to $MOUNT_POINT"
}

main "$@"

main "$@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -eux if we apply the aforementioned change.

Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@ FSX_OPENZFS_DNS_NAME="$1"
OPENZFS_MOUNT_POINT="$2"
NFS_VERSION=4.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be hardcoded? Can we add a comment why Version 4.2 is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pin versions. For everything.


# Ansible Version
ANSIBLE_VERSION="10.7.0"

# Function for error handling
handle_error()
{
Expand All @@ -33,16 +30,6 @@ verify_parameters()
fi
}

# Install Ansible and collections: Move to higher LCS once others start using Ansible too.
install_ansible()
{
apt-get update
# apt-get install -y ansible=$ANSIBLE_VERSION
apt-get install -y python3-pip
python3 -m pip install "ansible==${ANSIBLE_VERSION}"
ansible-galaxy collection install ansible.posix
}

# Install NFS Client based on OS
install_nfs_client()
{
Expand All @@ -66,7 +53,6 @@ main()
echo "Mount_fsx_openzfs called with fsx_openzfs_dns_name: $FSX_OPENZFS_DNS_NAME"
echo "Using openzfs_mount_point: $OPENZFS_MOUNT_POINT"
verify_parameters
install_ansible
install_nfs_client
mount_fs
echo "FSx OpenZFS mounted successfully to $OPENZFS_MOUNT_POINT"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash

set -ex

# Ansible Version
ANSIBLE_VERSION="10.7.0"

# Install Ansible and collections: Move to higher LCS once others start using Ansible too.
install_ansible()
{
apt-get update
# apt-get install -y ansible=$ANSIBLE_VERSION
apt-get install -y python3-pip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need retry on the network call, similar to what we have with other calls that was due to issues observed with u22.04 AMI. How many nodes has this script been tested on?

python3 -m pip install "ansible==${ANSIBLE_VERSION}"
ansible-galaxy collection install ansible.posix

# Verify ansible installation
echo "Ansible version:"
ansible --version
}

main()
{
echo "Installing Ansible..."
install_ansible
}

main