Increase MaxConnections and MaxStartup in sshd config #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When more than 10 data movement requests come in for a particular
rabbit, the default sshd configuration starts dropping 30% of
connections and drops all after 100 connections (the default value is
set to 10:30:100). This causes data movement requests to fail since any
concurrency over 10 causes ssh to close connections (from mpirun).
This change increases that value to be able handle the max theoretical
load for a particular rabbit. This image runs on 1 pod per rabbit node
(i.e. nnf-dm-worker-*) and each rabbit node supports 16 compute nodes of
192 cores. Each core on a compute node could be creating a data movement
request.
16 * 192 = 3072
Bump it up to an power of 2 for good measure -> 4096.
This is a start. I believe this number will need to be much higher if there is a
lustre job that spans every node and every node tries to start DM requests via SCR.
But 4096 is much better than 10.
Signed-off-by: Blake Devcich [email protected]