Skip to content

Lustre mount via Ansible for SMHP Slurm LCS #682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

amanshanbhag
Copy link
Contributor

Issue #, if available: #642

Description of changes: Using Ansible to mount FSxL file system -- simplifying the LCS by getting rid of all the redundancies. Simple ansible.posix.mount

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@amanshanbhag
Copy link
Contributor Author

Testing once before ready to merge

@amanshanbhag amanshanbhag requested a review from mhuguesaws May 15, 2025 22:48
@amanshanbhag
Copy link
Contributor Author

Seems like there's a difference in the way OpenZFS (NFS) and Lustre interact with the x-systemd.automount option. The FSxL file system isn't mounted on boot, only when accessed. Debating whether to remove that option

@amanshanbhag
Copy link
Contributor Author

I see the entry in /etc/fstab, just not in df -Th...

@amanshanbhag
Copy link
Contributor Author

Weird, I can mount it when run directly on the instance, but it doesn't seem to mount as a LifeCycleScript (with and without the automount option)

@mhuguesaws
Copy link
Contributor

It's mounted on access since it is a systemd automount

@amanshanbhag
Copy link
Contributor Author

Yeah, but when I try to mimic access in the LCS (egs. ls -la /fsx/ubuntu), I still don't see it mounted. I understand that an automount fs (and fstab entry) is created prior, and only mounted on access

@mhuguesaws
Copy link
Contributor

Check the systemd daemon status
systemctl status fsx.automount

@amanshanbhag
Copy link
Contributor Author

Fixed.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhuguesaws mhuguesaws self-assigned this May 22, 2025
amanshanbhag added a commit that referenced this pull request May 27, 2025
…oon); Adding -y flag to ssh-keygen (Issue #676)
amanshanbhag added a commit that referenced this pull request May 28, 2025
* Fixing Race Conditions reported in #674

* Fixing lifecycle_script.py by removing ansible call (PR #682 coming soon); Adding -y flag to ssh-keygen (Issue #676)

* Converting 1/0 int --> str based on class def

* Adding --allow-downgrades to install_docker.sh to account for pinned version (< what's available in AMI). NVIDIA/nvidia-container-toolkit#1093

* Reverting back -y for ssh-keygen, moving to gracious error handling + continue instead

* Adding or to previous error handling
amanshanbhag and others added 5 commits May 28, 2025 14:02
* Adding ssh keys to additional (OZFS at `/home`) file system (#700)

* Testing ssh with ozfs LCS

* Replacing ********* with localhost in OZFS mount script (#696) (#699)

* Update gen-keypair-ubuntu.sh

* Update nccl-tests.Dockerfile

* [feat]: Add describe alarm permissions in the execution role for Rolling Update Autorollback. (#698)

Co-authored-by: Vinay Devadiga <[email protected]>

* fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readm… (#701)

* fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readme, and cleaned yaml arguments

* updating PR from feedback: using generate-sbatch-training-files, and ReadMe updated

* small readme fix in envsubst line

* model file name change, generate training file pytorchjob name update

* Update 3.test_cases/pytorch/FSDP/kubernetes/README.md

---------

Co-authored-by: mhuguesaws <[email protected]>

* Fixing Race Conditions reported in #674 (#703)

* Fixing Race Conditions reported in #674

* Fixing lifecycle_script.py by removing ansible call (PR #682 coming soon); Adding -y flag to ssh-keygen (Issue #676)

* Converting 1/0 int --> str based on class def

* Adding --allow-downgrades to install_docker.sh to account for pinned version (< what's available in AMI). NVIDIA/nvidia-container-toolkit#1093

* Reverting back -y for ssh-keygen, moving to gracious error handling + continue instead

* Adding or to previous error handling

---------

Co-authored-by: mhuguesaws <[email protected]>
Co-authored-by: Vinay Devadiga <[email protected]>
Co-authored-by: Vinay Devadiga <[email protected]>
Co-authored-by: mvinci12 <[email protected]>
@amanshanbhag
Copy link
Contributor Author

Tested with:

  1. Cluster with Head Node + Worker Nodes + only FSxL
  2. Cluster with Login Node + Worker Nodes + FSxL + FSxOpenZFS

Works as expected.

Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. Reviewed files under 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config. Suggested few changes for possible improvements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -eux if we apply the aforementioned change.

Copy link
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase the branch to only show the code changes. There are unrelated file change in this PR.

@mhuguesaws
Copy link
Contributor

Still waiting on rebase to resolve conflicts.

amanshanbhag and others added 9 commits June 9, 2025 10:01
…fig/utils/fsx_ubuntu.sh

Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/fsx_ubuntu.sh

Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh

Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh

Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh

Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh

Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh

Co-authored-by: Keita Watanabe <[email protected]>
@amanshanbhag
Copy link
Contributor Author

Conflicts with main resolved

@@ -160,6 +160,8 @@ def main(args):
params = ProvisioningParameters(args.provisioning_parameters)
resource_config = ResourceConfig(args.resource_config)

ExecuteBashScript("./utils/install_ansible.sh").run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this getting installed? need architecture diagram and readme

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify, does ansible get installed on every node?

Copy link
Contributor Author

@amanshanbhag amanshanbhag Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecture diagram for what? Yes, ansible, the package, is installed on every node (hence localhost to mount in inventory)

@@ -10,9 +10,6 @@ FSX_OPENZFS_DNS_NAME="$1"
OPENZFS_MOUNT_POINT="$2"
NFS_VERSION=4.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be hardcoded? Can we add a comment why Version 4.2 is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pin versions. For everything.

{
apt-get update
# apt-get install -y ansible=$ANSIBLE_VERSION
apt-get install -y python3-pip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need retry on the network call, similar to what we have with other calls that was due to issues observed with u22.04 AMI. How many nodes has this script been tested on?

@@ -27,6 +27,7 @@ echo \
$(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get -y -o DPkg::Lock::Timeout=120 update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add backoff on these apt-get calls? how many nodes has this been tested on?

@nghtm
Copy link
Contributor

nghtm commented Jun 17, 2025

Can you please clarify before merging, what advantages does moving to ansible bring over bash for HyperPod Lifecycle scripts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants