-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy Offload requests are failing as we scale the number of using processes and compute nodes #196
Comments
From @bdevcich on 8/7: |
We've been able to reproduce this on our end just using YAML to create the data movement requests. So it has nothing to do with the Going one step further, we can also reproduce this by skipping data movement altogether and using
If we run this using a hostfile that contains localhost (such is the case for gfs2/xfs data movement), we don't see this problem. Presumably because ssh is not involved.
From what I understand, this means that after 10 unsecured connections, it will start dropping 30% of the new connections. After 100 unsecured connections, it will drop every new one. Each ssh connection is unsecured until authentication is complete, so even though we're using ssh keys, this still applies. Bumping the value up to To fix this, we need to change the nnf-mfu image used by the nnf-dm-worker pods. The images used for the nnf-dm-workers should be able to be changed using gitops, so this fix might not require any code change. That being said, I do have some improvements coming (like being able to change the command used for stat) in the |
The fix to increase the number of ssh connections is here: The ability to alter the |
I've updated the
Once changed, this will cause the |
Additionally, the entire |
We are seeing the following failure from the nnf-dm API when running a 4-node/16-process allocation:
(Further detailed discussion was created here: https://llnl.slack.com/archives/C020U81E05U/p1722894207606109, and here: https://llnl.slack.com/archives/C020U81E05U/p1722967213939429
Shortly after starting the test, we see the following failure when the copy offload has started:
The text was updated successfully, but these errors were encountered: