-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NNFDM - SCR/AXL Receiving a cannot-stat error for file that should exist #189
Comments
Starting information for this failure was:
|
Ending information, including the error was:
|
It should be noted that Are these files removed after they copy completes by the nnfdm? |
This looks like lustre to lustre data movement, yes?
Files are not removed until the workflow is torn down and the filesystem is deleted. nnf-dm is trying to stat the file to understand if it's a file or a directory so it can make decisions on how/if it needs to create the proper destination directory (to make sure it exists). When doing lustre-lustre data movement, that stat is performed via mpirun so that it can run on the nnf-dm-worker pods which are running on the rabbit-nodes (mpirun is also used for xfs/gfs, but mpirun targets localhost since it runs on the worker pods). Based on the error message, it looks like mpirun was not able to hit the nnf-dm-worker pod from the nnf-dm-controller-manager pod:
When this happens, are you able to confirm that there is a nnf-dm-worker pod at the IP address listed above? You can do that by querying the pods and looking at their IPs:
It's possible that the pod is in a non-Running state or there are network issues at play here. |
Yes, this is a lustre to lustre data movement. The dm-workerpods all seem to be present:
|
What seems strange to me is that I am noticing this error message when the following are true:
Additionally, it appears that the file was indeed copied to the destination directory. |
I have been able to reproduce this with
This command tells
|
This is the error message that is received and it's an mpirun issue:
We've been able to reproduce this on elcap with @mcfadden8's handy reproducer:
This test application is running 48 processes across 12 nodes. This results in 47 The What is happening is that mpirun is failing to talk to the |
Out of the 47 DataMovements, a handful of these hit the issue. It seems to be about 3-5 of them from what I've seen. |
The output for this error suggest setting MCA parameter
This seems to be a workaround when combined with SCR setting requests with slot=1, max-slots=-1. This effectively translates to using slot=1 in the mpirun hostile and not setting max-slots at all because the the server side setting is to not use max-slots. Unfortunately, we don't have a way to make To set the parameter, we need to go into the
|
I am able to reproduce this on our end. I will get back to this when I return to the office on Monday, Aug 19. |
This (I believe) is a dupe of #196. Closing. |
This seems to be a timing related issue as it does not always happen. But it is happening with a 4 nodes and 4 processes per node allocation (16 processes total).
The following line of code succeeds:
'''
nnfdm::RPCStatus rpc_status{ nnfdm_client->Status(*nnfdm_workflow, nnfdm::StatusRequest{std::string{uid}, max_seconds_to_wait}, &status_response) };
'''
rpc_status.ok()
is true and thestatus_response.state()
is set toSTATE_COMPLETED
. However, thestatus_response.status()
is not set tonnfdm::StatusResponse::Status::STATUS_SUCCESS
and the decoded response is logged below.The text was updated successfully, but these errors were encountered: