Copy Offload requests are failing as we scale the number of using processes and compute nodes #196

mcfadden8 · 2024-08-19T18:07:35Z

We are seeing the following failure from the nnf-dm API when running a 4-node/16-process allocation:
(Further detailed discussion was created here: https://llnl.slack.com/archives/C020U81E05U/p1722894207606109, and here: https://llnl.slack.com/archives/C020U81E05U/p1722967213939429

flux run '--requires=-host:rzadams[1033-1048]' -q pdev -t 1h -N4 -n16 --setattr=dw="#DW jobdw type=lustre capacity=1440GiB name=scrcache requires=copy-offload" ./test_api -t 5 -s 9GB
flux-job: fcJbo2NAK3m

Shortly after starting the test, we see the following failure when the copy offload has started:

2024-08-05 14:37:22:937 AXL ERROR:1971777 rzadams1068: @ nnfdm_stat:96 NNFDM Offload Status UNSUCCESSFUL: 0
Offload Command Status: {3/STATE_COMPLETED}, {0/STATUS_INVALID}
    Offload Command Status:
      Command:
      Progress: 0%
      ElapsedTime:
      LastMessage:
      LastMessageTime:
    Offload StartTime:
    Offload EndTime:
    Offload Message: internal error: could not determine source type: could not stat path ('/mnt/nnf/df5e1995-c0f6-4b56-8e8b-a00d7a70807d-0/martymcf/scr.defjobid/scr.dataset.1/rank_9.ckpt'): command: mpirun --allow-run-as-root -n 1 --hostfile /tmp/nnf-dm-42p79/hostfile -- setpriv --euid 54987 --egid 54987 --clear-groups stat --cached never -c '%F' /mnt/nnf/df5e1995-c0f6-4b56-8e8b-a00d7a70807d-0/martymcf/scr.defjobid/scr.dataset.1/rank_9.ckpt - stderr: kex_exchange_identification: Connection closed by remote host
Connection closed by 10.85.232.28 port 2222
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.
*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

mcfadden8 · 2024-08-19T18:11:19Z

From @bdevcich on 8/7:
Good news. I can replicate this on our end.

bdevcich · 2024-08-23T14:12:24Z

We've been able to reproduce this on our end just using YAML to create the data movement requests. So it has nothing to do with the nnf-dm daemon itself.

Going one step further, we can also reproduce this by skipping data movement altogether and using mpirun to run sleep commands from the nnf-dm-controller pod (running on a k8s control plane node) on a nnf-dm-worker pod (running on a rabbit node).

for i in {1..512}; do 
	mpirun --allow-run-as-root -n 1 --hostfile /tmp/blake sleep 10& 
done

$ cat /tmp/blake
10-42-3-193.dm.nnf-dm-system slots=2 max_slots=20

If we run this using a hostfile that contains localhost (such is the case for gfs2/xfs data movement), we don't see this problem. Presumably because ssh is not involved.

mpirun using ssh under the hood. The ssh configuration on the nnf-dm-worker pods running on the rabbit nodes have MaxStartups set to the default of 10:30:100.

From what I understand, this means that after 10 unsecured connections, it will start dropping 30% of the new connections. After 100 unsecured connections, it will drop every new one. Each ssh connection is unsecured until authentication is complete, so even though we're using ssh keys, this still applies. Bumping the value up to 512:-1:512 makes this problem go away so far in my testing.

To fix this, we need to change the nnf-mfu image used by the nnf-dm-worker pods. The images used for the nnf-dm-workers should be able to be changed using gitops, so this fix might not require any code change.

That being said, I do have some improvements coming (like being able to change the command used for stat) in the NnfDataMovementProfile.

bdevcich · 2024-08-23T19:33:04Z

The fix to increase the number of ssh connections is here:

Increase MaxConnections and MaxStartup in sshd config nnf-mfu#21

The ability to alter the stat command via NnfDataMovementProfiles is being added here:

bdevcich · 2024-08-23T20:03:30Z

I've updated the nnf-mfu image to use 4096 for MaxStartups. This image can be used by updating the gitops configuration to pull the new image by updating the image tag to 0.0.0.21-8221. This change should look similar to this:

diff --git a/environments/htx-1/nnf-dm/nnf-dm.yaml b/environments/htx-1/nnf-dm/nnf-dm.yaml
index 0d184e4..09ca3df 100644
--- a/environments/htx-1/nnf-dm/nnf-dm.yaml
+++ b/environments/htx-1/nnf-dm/nnf-dm.yaml
@@ -560,7 +560,7 @@ spec:
         - -De
         command:
         - /usr/sbin/sshd
-        image: ghcr.io/nearnodeflash/nnf-mfu:0.1.1
+        image: ghcr.io/nearnodeflash/nnf-mfu:0.0.0.21-8221
         name: worker
         securityContext:
           capabilities:

Once changed, this will cause the nnf-dm-worker-* pods to restart, but no upgrade is necessary. I've tested this on top of the latest NNF release (v0.1.6) and it should resolve the problem short term. I believe more tweaking will be necessary long time.

bdevcich · 2024-09-05T16:45:55Z

Additionally, the entire sshd_config is now configurable by editing the nnf-dm-worker-config ConfigMap in kubernetes. Users can increase the value of MaxStartups by editing the ConfigMap: https://nearnodeflash.github.io/dev/guides/data-movement/readme/#sshd-configuration-for-data-movement-workers

NearNodeFlash/nnf-dm#208

github-project-automation bot added this to Issues Dashboard Aug 19, 2024

github-project-automation bot moved this to 📋 Open in Issues Dashboard Aug 19, 2024

bdevcich moved this from 📋 Open to 🏗 In progress in Issues Dashboard Aug 23, 2024

bdevcich self-assigned this Aug 23, 2024

bdevcich mentioned this issue Aug 23, 2024

Increase MaxConnections and MaxStartup in sshd config NearNodeFlash/nnf-mfu#21

Merged

bdevcich mentioned this issue Aug 23, 2024

NNFDM - SCR/AXL Receiving a cannot-stat error for file that should exist #189

Closed

bdevcich closed this as completed Sep 5, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Closed in Issues Dashboard Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy Offload requests are failing as we scale the number of using processes and compute nodes #196

Copy Offload requests are failing as we scale the number of using processes and compute nodes #196

mcfadden8 commented Aug 19, 2024 •

edited

Loading

mcfadden8 commented Aug 19, 2024

bdevcich commented Aug 23, 2024

bdevcich commented Aug 23, 2024 •

edited

Loading

bdevcich commented Aug 23, 2024

bdevcich commented Sep 5, 2024

Copy Offload requests are failing as we scale the number of using processes and compute nodes #196

Copy Offload requests are failing as we scale the number of using processes and compute nodes #196

Comments

mcfadden8 commented Aug 19, 2024 • edited Loading

mcfadden8 commented Aug 19, 2024

bdevcich commented Aug 23, 2024

bdevcich commented Aug 23, 2024 • edited Loading

bdevcich commented Aug 23, 2024

bdevcich commented Sep 5, 2024

mcfadden8 commented Aug 19, 2024 •

edited

Loading

bdevcich commented Aug 23, 2024 •

edited

Loading