Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy Offload requests are failing as we scale the number of using processes and compute nodes #196

Closed
mcfadden8 opened this issue Aug 19, 2024 · 5 comments
Assignees

Comments

@mcfadden8
Copy link

mcfadden8 commented Aug 19, 2024

We are seeing the following failure from the nnf-dm API when running a 4-node/16-process allocation:
(Further detailed discussion was created here: https://llnl.slack.com/archives/C020U81E05U/p1722894207606109, and here: https://llnl.slack.com/archives/C020U81E05U/p1722967213939429

flux run '--requires=-host:rzadams[1033-1048]' -q pdev -t 1h -N4 -n16 --setattr=dw="#DW jobdw type=lustre capacity=1440GiB name=scrcache requires=copy-offload" ./test_api -t 5 -s 9GB
flux-job: fcJbo2NAK3m 

Shortly after starting the test, we see the following failure when the copy offload has started:

2024-08-05 14:37:22:937 AXL ERROR:1971777 rzadams1068: @ nnfdm_stat:96 NNFDM Offload Status UNSUCCESSFUL: 0
Offload Command Status: {3/STATE_COMPLETED}, {0/STATUS_INVALID}
    Offload Command Status:
      Command:
      Progress: 0%
      ElapsedTime:
      LastMessage:
      LastMessageTime:
    Offload StartTime:
    Offload EndTime:
    Offload Message: internal error: could not determine source type: could not stat path ('/mnt/nnf/df5e1995-c0f6-4b56-8e8b-a00d7a70807d-0/martymcf/scr.defjobid/scr.dataset.1/rank_9.ckpt'): command: mpirun --allow-run-as-root -n 1 --hostfile /tmp/nnf-dm-42p79/hostfile -- setpriv --euid 54987 --egid 54987 --clear-groups stat --cached never -c '%F' /mnt/nnf/df5e1995-c0f6-4b56-8e8b-a00d7a70807d-0/martymcf/scr.defjobid/scr.dataset.1/rank_9.ckpt - stderr: kex_exchange_identification: Connection closed by remote host
Connection closed by 10.85.232.28 port 2222
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.
*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
@mcfadden8
Copy link
Author

From @bdevcich on 8/7:
Good news. I can replicate this on our end.

@bdevcich bdevcich moved this from 📋 Open to 🏗 In progress in Issues Dashboard Aug 23, 2024
@bdevcich bdevcich self-assigned this Aug 23, 2024
@bdevcich
Copy link
Contributor

We've been able to reproduce this on our end just using YAML to create the data movement requests. So it has nothing to do with the nnf-dm daemon itself.

Going one step further, we can also reproduce this by skipping data movement altogether and using mpirun to run sleep commands from the nnf-dm-controller pod (running on a k8s control plane node) on a nnf-dm-worker pod (running on a rabbit node).

for i in {1..512}; do 
	mpirun --allow-run-as-root -n 1 --hostfile /tmp/blake sleep 10& 
done
$ cat /tmp/blake
10-42-3-193.dm.nnf-dm-system slots=2 max_slots=20

If we run this using a hostfile that contains localhost (such is the case for gfs2/xfs data movement), we don't see this problem. Presumably because ssh is not involved.

mpirun using ssh under the hood. The ssh configuration on the nnf-dm-worker pods running on the rabbit nodes have MaxStartups set to the default of 10:30:100.

From what I understand, this means that after 10 unsecured connections, it will start dropping 30% of the new connections. After 100 unsecured connections, it will drop every new one. Each ssh connection is unsecured until authentication is complete, so even though we're using ssh keys, this still applies. Bumping the value up to 512:-1:512 makes this problem go away so far in my testing.

To fix this, we need to change the nnf-mfu image used by the nnf-dm-worker pods. The images used for the nnf-dm-workers should be able to be changed using gitops, so this fix might not require any code change.

That being said, I do have some improvements coming (like being able to change the command used for stat) in the NnfDataMovementProfile.

@bdevcich
Copy link
Contributor

bdevcich commented Aug 23, 2024

The fix to increase the number of ssh connections is here:

The ability to alter the stat command via NnfDataMovementProfiles is being added here:

@bdevcich
Copy link
Contributor

I've updated the nnf-mfu image to use 4096 for MaxStartups. This image can be used by updating the gitops configuration to pull the new image by updating the image tag to 0.0.0.21-8221. This change should look similar to this:

diff --git a/environments/htx-1/nnf-dm/nnf-dm.yaml b/environments/htx-1/nnf-dm/nnf-dm.yaml
index 0d184e4..09ca3df 100644
--- a/environments/htx-1/nnf-dm/nnf-dm.yaml
+++ b/environments/htx-1/nnf-dm/nnf-dm.yaml
@@ -560,7 +560,7 @@ spec:
         - -De
         command:
         - /usr/sbin/sshd
-        image: ghcr.io/nearnodeflash/nnf-mfu:0.1.1
+        image: ghcr.io/nearnodeflash/nnf-mfu:0.0.0.21-8221
         name: worker
         securityContext:
           capabilities:

Once changed, this will cause the nnf-dm-worker-* pods to restart, but no upgrade is necessary. I've tested this on top of the latest NNF release (v0.1.6) and it should resolve the problem short term. I believe more tweaking will be necessary long time.

@bdevcich
Copy link
Contributor

bdevcich commented Sep 5, 2024

Additionally, the entire sshd_config is now configurable by editing the nnf-dm-worker-config ConfigMap in kubernetes. Users can increase the value of MaxStartups by editing the ConfigMap: https://nearnodeflash.github.io/dev/guides/data-movement/readme/#sshd-configuration-for-data-movement-workers

NearNodeFlash/nnf-dm#208

@bdevcich bdevcich closed this as completed Sep 5, 2024
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Closed in Issues Dashboard Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

No branches or pull requests

2 participants