Increase MaxConnections and MaxStartup in sshd config #21

bdevcich · 2024-08-23T19:23:50Z

When more than 10 data movement requests come in for a particular
rabbit, the default sshd configuration starts dropping 30% of
connections and drops all after 100 connections (the default value is
set to 10:30:100). This causes data movement requests to fail since any
concurrency over 10 causes ssh to close connections (from mpirun).

This change increases that value to be able handle the max theoretical
load for a particular rabbit. This image runs on 1 pod per rabbit node
(i.e. nnf-dm-worker-*) and each rabbit node supports 16 compute nodes of
192 cores. Each core on a compute node could be creating a data movement
request.

16 * 192 = 3072

Bump it up to an power of 2 for good measure -> 4096.

This is a start. I believe this number will need to be much higher if there is a
lustre job that spans every node and every node tries to start DM requests via SCR.
But 4096 is much better than 10.

Signed-off-by: Blake Devcich [email protected]

When more than 10 data movement requests come in for a particular rabbit, the default sshd configuration starts dropping 30% of connections and drops all after 100 connections (the default value is set to 10:30:100). This causes data movement requests to fail since any concurrency over 10 causes ssh to close connections (from mpirun). This change increases that value to be able handle the max theoretical load for a particular rabbit. This image runs on 1 pod per rabbit node (i.e. nnf-dm-worker-*) and each rabbit node supports 16 compute nodes of 192 cores. Each core on a compute node could be creating a data movement request. 16 * 192 = 3072 Bump it up to an power of 2 for good measure -> 4096. Signed-off-by: Blake Devcich <[email protected]>

bdevcich · 2024-08-23T19:24:11Z

This addresses NearNodeFlash/NearNodeFlash.github.io#196

bdevcich requested review from ajfloeder, matthew-richerson and roehrich-hpe August 23, 2024 19:23

bdevcich mentioned this pull request Aug 23, 2024

Copy Offload requests are failing as we scale the number of using processes and compute nodes NearNodeFlash/NearNodeFlash.github.io#196

Closed

roehrich-hpe approved these changes Aug 23, 2024

View reviewed changes

matthew-richerson approved these changes Aug 23, 2024

View reviewed changes

bdevcich merged commit 822127c into master Aug 23, 2024
3 checks passed

bdevcich deleted the max-startups branch August 23, 2024 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase MaxConnections and MaxStartup in sshd config #21

Increase MaxConnections and MaxStartup in sshd config #21

bdevcich commented Aug 23, 2024

bdevcich commented Aug 23, 2024

Increase MaxConnections and MaxStartup in sshd config #21

Increase MaxConnections and MaxStartup in sshd config #21

Conversation

bdevcich commented Aug 23, 2024

bdevcich commented Aug 23, 2024