-
Notifications
You must be signed in to change notification settings - Fork 126
Lustre mount via Ansible for SMHP Slurm LCS #682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Testing once before ready to merge |
Seems like there's a difference in the way OpenZFS (NFS) and Lustre interact with the |
I see the entry in |
Weird, I can mount it when run directly on the instance, but it doesn't seem to mount as a LifeCycleScript (with and without the |
It's mounted on access since it is a systemd automount |
Yeah, but when I try to mimic access in the LCS (egs. |
Check the systemd daemon status |
Fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…oon); Adding -y flag to ssh-keygen (Issue #676)
* Fixing Race Conditions reported in #674 * Fixing lifecycle_script.py by removing ansible call (PR #682 coming soon); Adding -y flag to ssh-keygen (Issue #676) * Converting 1/0 int --> str based on class def * Adding --allow-downgrades to install_docker.sh to account for pinned version (< what's available in AMI). NVIDIA/nvidia-container-toolkit#1093 * Reverting back -y for ssh-keygen, moving to gracious error handling + continue instead * Adding or to previous error handling
* Adding ssh keys to additional (OZFS at `/home`) file system (#700) * Testing ssh with ozfs LCS * Replacing ********* with localhost in OZFS mount script (#696) (#699) * Update gen-keypair-ubuntu.sh * Update nccl-tests.Dockerfile * [feat]: Add describe alarm permissions in the execution role for Rolling Update Autorollback. (#698) Co-authored-by: Vinay Devadiga <[email protected]> * fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readm… (#701) * fsdp k8s yaml to use c10d rdzv backend instead of etcd, updated readme, and cleaned yaml arguments * updating PR from feedback: using generate-sbatch-training-files, and ReadMe updated * small readme fix in envsubst line * model file name change, generate training file pytorchjob name update * Update 3.test_cases/pytorch/FSDP/kubernetes/README.md --------- Co-authored-by: mhuguesaws <[email protected]> * Fixing Race Conditions reported in #674 (#703) * Fixing Race Conditions reported in #674 * Fixing lifecycle_script.py by removing ansible call (PR #682 coming soon); Adding -y flag to ssh-keygen (Issue #676) * Converting 1/0 int --> str based on class def * Adding --allow-downgrades to install_docker.sh to account for pinned version (< what's available in AMI). NVIDIA/nvidia-container-toolkit#1093 * Reverting back -y for ssh-keygen, moving to gracious error handling + continue instead * Adding or to previous error handling --------- Co-authored-by: mhuguesaws <[email protected]> Co-authored-by: Vinay Devadiga <[email protected]> Co-authored-by: Vinay Devadiga <[email protected]> Co-authored-by: mvinci12 <[email protected]>
Tested with:
Works as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me. Reviewed files under 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config
. Suggested few changes for possible improvements.
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh
Outdated
Show resolved
Hide resolved
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set -eux
if we apply the aforementioned change.
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/fsx_ubuntu.sh
Outdated
Show resolved
Hide resolved
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/fsx_ubuntu.sh
Outdated
Show resolved
Hide resolved
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/gen-keypair-ubuntu.sh
Outdated
Show resolved
Hide resolved
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/gen-keypair-ubuntu.sh
Outdated
Show resolved
Hide resolved
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/gen-keypair-ubuntu.sh
Outdated
Show resolved
Hide resolved
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/gen-keypair-ubuntu.sh
Outdated
Show resolved
Hide resolved
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/gen-keypair-ubuntu.sh
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase the branch to only show the code changes. There are unrelated file change in this PR.
Still waiting on rebase to resolve conflicts. |
…fig/mount_fsx.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/mount_fsx.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/fsx_ubuntu.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/fsx_ubuntu.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh Co-authored-by: Keita Watanabe <[email protected]>
…fig/utils/gen-keypair-ubuntu.sh Co-authored-by: Keita Watanabe <[email protected]>
Conflicts with main resolved |
@@ -160,6 +160,8 @@ def main(args): | |||
params = ProvisioningParameters(args.provisioning_parameters) | |||
resource_config = ResourceConfig(args.resource_config) | |||
|
|||
ExecuteBashScript("./utils/install_ansible.sh").run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this getting installed? need architecture diagram and readme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify, does ansible get installed on every node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Architecture diagram for what? Yes, ansible, the package, is installed on every node (hence localhost to mount in inventory)
@@ -10,9 +10,6 @@ FSX_OPENZFS_DNS_NAME="$1" | |||
OPENZFS_MOUNT_POINT="$2" | |||
NFS_VERSION=4.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be hardcoded? Can we add a comment why Version 4.2 is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pin versions. For everything.
{ | ||
apt-get update | ||
# apt-get install -y ansible=$ANSIBLE_VERSION | ||
apt-get install -y python3-pip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need retry on the network call, similar to what we have with other calls that was due to issues observed with u22.04 AMI. How many nodes has this script been tested on?
@@ -27,6 +27,7 @@ echo \ | |||
$(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null | |||
apt-get -y -o DPkg::Lock::Timeout=120 update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add backoff on these apt-get calls? how many nodes has this been tested on?
Can you please clarify before merging, what advantages does moving to ansible bring over bash for HyperPod Lifecycle scripts? |
Issue #, if available: #642
Description of changes: Using Ansible to mount FSxL file system -- simplifying the LCS by getting rid of all the redundancies. Simple
ansible.posix.mount
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.