Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNFDM - SCR/AXL Receiving a cannot-stat error for file that should exist #189

Closed
mcfadden8 opened this issue Jul 31, 2024 · 12 comments
Closed

Comments

@mcfadden8
Copy link

This seems to be a timing related issue as it does not always happen. But it is happening with a 4 nodes and 4 processes per node allocation (16 processes total).

The following line of code succeeds:

'''
nnfdm::RPCStatus rpc_status{ nnfdm_client->Status(*nnfdm_workflow, nnfdm::StatusRequest{std::string{uid}, max_seconds_to_wait}, &status_response) };
'''

rpc_status.ok() is true and the status_response.state() is set to STATE_COMPLETED. However, the status_response.status() is not set to nnfdm::StatusResponse::Status::STATUS_SUCCESS and the decoded response is logged below.

AXL rzadams1028: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0000-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0000-00000.silo)
AXL rzadams1028: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0003-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0003-00000.silo)
AXL rzadams1028: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0002-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0002-00000.silo)
AXL rzadams1028: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0001-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0001-00000.silo)
AXL rzadams1058: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0012-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0012-00000.silo)
AXL rzadams1058: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0013-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0013-00000.silo)
AXL rzadams1051: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0008-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0008-00000.silo)
AXL rzadams1050: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0005-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0005-00000.silo)
AXL rzadams1051: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0011-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0011-00000.silo)
AXL rzadams1050: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0004-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0004-00000.silo)
AXL rzadams1051: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0010-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0010-00000.silo)
AXL rzadams1050: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0007-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0007-00000.silo)
AXL rzadams1051: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0009-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0009-00000.silo)
AXL rzadams1050: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0006-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0006-00000.silo)
AXL rzadams1058: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0015-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0015-00000.silo)
AXL rzadams1058: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0014-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0014-00000.silo)
AXL rzadams1028: @ nnfdm_start:173  Request(src=/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx00000.root, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000.root)
AXL 0.8.0 ERROR: rzadams1028: NNFDM Offload Status UNSUCCESSFUL: 0
Offload Command Status: {3/STATE_COMPLETED}, {0/STATUS_INVALID}
    Offload Command Status:
      Command: 
      Progress: 0%
      ElapsedTime: 
      LastMessage: 
      LastMessageTime: 
    Offload StartTime: 
    Offload EndTime: 
    Offload Message: internal error: could not determine source type: could not stat path ('/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0003-00000.silo'): command: mpirun --allow-run-as-root -n 1 --hostfile /tmp/nnf-dm-7vwv8/hostfile -- setpriv --euid 54987 --egid 54987 --clear-groups stat --cached never -c '%F' /mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0003-00000.silo - stderr: Warning: Permanently added '[10-85-250-25.dm.nnf-dm-system]:2222,[10.85.250.25]:2222' (ECDSA) to the list of known hosts.
kex_exchange_identification: Connection closed by remote host
Connection closed by 10.85.115.161 port 2222
@mcfadden8
Copy link
Author

Starting information for this failure was:

flux run -vvvv '--requires=-host:rzadams[1033-1048]' -q pdev -t 1h -N4 -n16 --setattr=dw="#DW jobdw type=lustre capacity=640GiB name=scrcache requires=copy-offload" ./ares.opt -scr_sync_storage -cwd /p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4 -set cstop=400 -def NZ=646.07222815105718621654 -r ./ForcedTurbDriver
jobid: fbLcahHMJD5
0.000s: job.submit {"userid":54987,"urgency":16,"flags":0,"version":1}
0.018s: job.validate
0.030s: job.dependency-add {"description":"dws-create"}
0.050s: job.memo {"rabbit_workflow":"fluxjob-255052157852582912"}
3.036s: job.dependency-remove {"description":"dws-create"}
3.036s: job.depend
3.036s: job.priority {"priority":16}
3.051s: job.alloc {"annotations":{"user":{"rabbit_workflow":"fluxjob-255052157852582912"}}}
3.051s: job.prolog-start {"description":"cray-pals-port-distributor"}
3.051s: job.prolog-start {"description":"dws-setup"}
3.051s: job.prolog-start {"description":"job-manager.prolog"}
3.051s: job.cray_port_distribution {"ports":[11956,11957],"random_integer":4670914926262541282}
3.051s: job.prolog-finish {"description":"cray-pals-port-distributor","status":0}
4.027s: job.memo {"rabbits":"rzadams[202,204]"}
38.024s: job.dws_environment {"variables":{"DW_JOB_scrcache":"/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0","DW_WORKFLOW_NAME":"fluxjob-255052157852582912","DW_WORKFLOW_NAMESPACE":"default"},"rabbits":{"rzadams202":"rzadams[1017-1032]","rzadams204":"rzadams[1049-1064]"},"copy_offload":true}
38.024s: job.prolog-finish {"description":"dws-setup","status":0}
39.132s: job.prolog-finish {"description":"job-manager.prolog","status":0}
39.149s: job.start
39.133s: exec.init
39.137s: exec.starting
39.472s: exec.shell.init {"service":"54987-shell-fbLcahHMJD5","leader-rank":28,"size":4}
39.587s: exec.shell.start {"taskmap":{"version":1,"map":[[0,4,4,1]]}}

@mcfadden8
Copy link
Author

Ending information, including the error was:

AXL 0.8.0 ERROR: rzadams1028: NNFDM Offload Status UNSUCCESSFUL: 0
Offload Command Status: {3/STATE_COMPLETED}, {0/STATUS_INVALID}
    Offload Command Status:
      Command: 
      Progress: 0%
      ElapsedTime: 
      LastMessage: 
      LastMessageTime: 
    Offload StartTime: 
    Offload EndTime: 
    Offload Message: internal error: could not determine source type: could not stat path ('/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0003-00000.silo'): command: mpirun --allow-run-as-root -n 1 --hostfile /tmp/nnf-dm-7vwv8/hostfile -- setpriv --euid 54987 --egid 54987 --clear-groups stat --cached never -c '%F' /mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0003-00000.silo - stderr: Warning: Permanently added '[10-85-250-25.dm.nnf-dm-system]:2222,[10.85.250.25]:2222' (ECDSA) to the list of known hosts.
kex_exchange_identification: Connection closed by remote host
Connection closed by 10.85.115.161 port 2222
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   nnf-dm-controller-manager-5779768c56-mtdbp
  target node:  10-85-250-25.dm.nnf-dm-system

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
 - stdout:  - error: exit status 255

78.387s: exec.shell.task-exit {"localid":3,"rank":3,"state":"Exited","pid":2972751,"wait_status":65280,"signaled":0,"exitcode":255}
AXL rzadams1058: @ nnfdm_stat:87  Offload Complete(/mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0015-00000.silo)
108.396s: flux-shell[0]: FATAL: doom: rank 3 on host rzadams1028 exited and exit-timeout=30s has expired
108.398s: job.exception type=exec severity=0 rank 3 on host rzadams1028 exited and exit-timeout=30s has expired
108.398s: job.epilog-start {"description":"dws-epilog"}
108.694s: exec.complete {"status":36608}
108.694s: exec.done
flux-job: task(s) Terminated
108.694s: job.finish {"status":36608}
FAIL
run_ares_BDHVDTD -F lustre -N 4 -T 4 -N 4 -c 400 -S 40 2>&1  0.13s user 3.04s system 2% cpu 1:57.27 total
tee logfile  0.00s user 0.01s system 0% cpu 1:57.27 total

@mcfadden8
Copy link
Author

It should be noted that /mnt/nnf/5fcb20a4-f5de-411e-8c0a-1c81e6e388bf-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0003-00000.silo appears have been successfully copied to /p/lustre1/martymcf/BDH/lustre-scr_sync_storage/4/xxxx00000/xxxx-0003-00000.silo (I confirmed that the file exists there).

Are these files removed after they copy completes by the nnfdm?

@bdevcich
Copy link
Contributor

bdevcich commented Aug 1, 2024

This looks like lustre to lustre data movement, yes?

Are these files removed after they copy completes by the nnfdm?

Files are not removed until the workflow is torn down and the filesystem is deleted.

nnf-dm is trying to stat the file to understand if it's a file or a directory so it can make decisions on how/if it needs to create the proper destination directory (to make sure it exists). When doing lustre-lustre data movement, that stat is performed via mpirun so that it can run on the nnf-dm-worker pods which are running on the rabbit-nodes (mpirun is also used for xfs/gfs, but mpirun targets localhost since it runs on the worker pods).

Based on the error message, it looks like mpirun was not able to hit the nnf-dm-worker pod from the nnf-dm-controller-manager pod:

ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   nnf-dm-controller-manager-5779768c56-mtdbp
  target node:  10-85-250-25.dm.nnf-dm-system

When this happens, are you able to confirm that there is a nnf-dm-worker pod at the IP address listed above? You can do that by querying the pods and looking at their IPs:

kubectl get pods -n nnf-dm-system -o wide | grep nnf-dm-worker
nnf-dm-worker-k6hzh                                  2/2     Running   0          18h   10.85.250.25    rabbit-node-2       <none>           <none>
nnf-dm-worker-v2dpc                                  2/2     Running   0          18h   10.85.251.38   rabbit-node-1       <none>           <none>

It's possible that the pod is in a non-Running state or there are network issues at play here.

@mcfadden8
Copy link
Author

Yes, this is a lustre to lustre data movement.

The dm-workerpods all seem to be present:

[  8:31AM ]  [ martymcf@rzadams1001:~/source/scr/scr-mcfadden8_add-dm-client/axl(mcfadden8/add-dm-client✗) ]
kubectl get pods -n nnf-dm-system -o wide | grep nnf-dm-worker
nnf-dm-worker-5ndkr                                  2/2     Running   2          8d    10.85.213.88    rzadams207   <none>           <none>
nnf-dm-worker-67f9d                                  2/2     Running   2          8d    10.85.250.25    rzadams204   <none>           <none>
nnf-dm-worker-khwtj                                  2/2     Running   2          8d    10.85.146.87    rzadams206   <none>           <none>
nnf-dm-worker-lv6nd                                  2/2     Running   2          8d    10.85.115.161   rzadams202   <none>           <none>
nnf-dm-worker-qp6wp                                  2/2     Running   2          8d    10.85.55.151    rzadams208   <none>           <none>
nnf-dm-worker-sfwc4                                  2/2     Running   2          8d    10.85.232.28    rzadams205   <none>           <none>
nnf-dm-worker-wwqpf                                  2/2     Running   2          8d    10.85.52.94     rzadams201   <none>           <none>

@mcfadden8
Copy link
Author

What seems strange to me is that I am noticing this error message when the following are true:

  1. nnfdm::RPCStatus::ok() == true
  2. nnfdm::StatusResponse::state() == nnfdm::StatusResponse::State::STATE_COMPLETED
  3. nnfdm::StatusResponse::status() != nnfdm::StatusResponse::Status::STATUS_SUCCESS (was set to 0)

Additionally, it appears that the file was indeed copied to the destination directory.

@mcfadden8
Copy link
Author

mcfadden8 commented Aug 5, 2024

I have been able to reproduce this with test_api which is a simple test program for SCR. The problem has been reproduced on rzadams with the following command line:

flux run '--requires=-host:rzadams[1033-1048]' -q pdev -t 1h -N4 -n16 --setattr=dw="#DW jobdw type=lustre capacity=1440GiB name=scrcache requires=copy-offload" ./test_api -t 5 -s 9GB

This command tells test_api to create a sequence of 5 checkpoint file per process for 16 processes across 4 compute nodes.
The following error was produced by this command:

flux run '--requires=-host:rzadams[1033-1048]' -q pdev -t 1h -N4 -n16 --setattr=dw="#DW jobdw type=lustre capacity=1440GiB name=scrcache requires=copy-offload" ./test_api -t 5 -s 9GB
flux-job: fcJbo2NAK3m started                                                                                                                                                                                                                                                                                                                                       00:00:41
2024-08-05 14:37:22:937 AXL ERROR:1971777 rzadams1068: @ nnfdm_stat:96 NNFDM Offload Status UNSUCCESSFUL: 0
Offload Command Status: {3/STATE_COMPLETED}, {0/STATUS_INVALID}
    Offload Command Status:
      Command: 
      Progress: 0%
      ElapsedTime: 
      LastMessage: 
      LastMessageTime: 
    Offload StartTime: 
    Offload EndTime: 
    Offload Message: internal error: could not determine source type: could not stat path ('/mnt/nnf/df5e1995-c0f6-4b56-8e8b-a00d7a70807d-0/martymcf/scr.defjobid/scr.dataset.1/rank_9.ckpt'): command: mpirun --allow-run-as-root -n 1 --hostfile /tmp/nnf-dm-42p79/hostfile -- setpriv --euid 54987 --egid 54987 --clear-groups stat --cached never -c '%F' /mnt/nnf/df5e1995-c0f6-4b56-8e8b-a00d7a70807d-0/martymcf/scr.defjobid/scr.dataset.1/rank_9.ckpt - stderr: kex_exchange_identification: Connection closed by remote host
Connection closed by 10.85.232.28 port 2222

@bdevcich
Copy link
Contributor

bdevcich commented Aug 7, 2024

This is the error message that is received and it's an mpirun issue:

--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   nnf-dm-controller-manager-5779768c56-mtdbp
  target node:  10-85-250-25.dm.nnf-dm-system

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

We've been able to reproduce this on elcap with @mcfadden8's handy reproducer:

sudo su
su - martymcf
~martymcf/scr.repro/repro

repro is running the following flux job:

flux run -vvvv -q iotesting -t 1h -N12 -n48 --setattr=dw=#DW jobdw type=lustre capacity=4TiB name=scrcache requires=copy-offload"

This test application is running 48 processes across 12 nodes. This results in 47 NnfDataMovement requests (I'm assuming rank 0 doesn't create one?) being created at the same time. Since this is lustre to lustre data movement, the nnf-dm-controller-manager pod starts working on all 47 of these. The first thing each data movement request is attempting to do is to stat the source file to determine its file type. This happens through mpirun.

The nnf-dm-controller-manager runs on k8s nodes and targets the appropriate nnf-dm-worker pod running on the appropriate rabbit node. So it's launching mpirun to stat a file on the rabbit node. Depending on how flux schedules the job, this could be 1 or a number of rabbit nodes. So we have 1 launcher running 47 mpirun jobs a the same time and trying to target 1, 2, or 3 (or more) rabbit nodes.

What is happening is that mpirun is failing to talk to the nnf-dm-worker pod on the rabbit node, and we get the resulting error message. I suspect that this is a concurrency problem. This may be related to mpi configuration, slots, max slots, etc.

@bdevcich
Copy link
Contributor

bdevcich commented Aug 7, 2024

Out of the 47 DataMovements, a handful of these hit the issue. It seems to be about 3-5 of them from what I've seen.

@bdevcich
Copy link
Contributor

bdevcich commented Aug 7, 2024

The output for this error suggest setting MCA parameter route=direct:

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

This seems to be a workaround when combined with SCR setting requests with slot=1, max-slots=-1. This effectively translates to using slot=1 in the mpirun hostile and not setting max-slots at all because the the server side setting is to not use max-slots.

Unfortunately, we don't have a way to make ranged=direct stick with the current software. This must be done manually.

To set the parameter, we need to go into the nnf-dm-controller-manager pod and set it manually. This will not survive a pod restart. I have an extra parameter in there that might be useful for debugging:

# Get the name of the `nnf-dm-controller-manager-*` pod
$ kubectl get pod -n nnf-dm-system | grep controller
nnf-dm-controller-manager-5779768c56-dxpwj           2/2     Running   2              11d
nnf-dm-manager-controller-manager-648959b444-jm5sq   2/2     Running   24 (40h ago)   11d

# Shell into it
kubectl exec -it -n nnf-dm-system nnf-dm-controller-manager-5779768c56-dxpwj -c manager -- /bin/bash

# Set env variable
export OMPI_MCA_mpi_routed=direct
export OMPI_MCA_mpi_plm_base_verbose=5

# Set the parameter in global param file (belt and suspenders)
mkdir -p /root/.openmpi
echo -e "routed = direct\nplm_base_verbose = 5" > ~/.openmpi/mca-params.conf
cat /root/.openmpi/mca-params.conf

# Verify the current value is set to `"direct"` and not `""`
ompi_info --param all all --level 9 | grep -E "routed|plm_base_verbose" | grep "current value"

@bdevcich
Copy link
Contributor

bdevcich commented Aug 7, 2024

I am able to reproduce this on our end. I will get back to this when I return to the office on Monday, Aug 19.

@bdevcich
Copy link
Contributor

This (I believe) is a dupe of #196. Closing.

@github-project-automation github-project-automation bot moved this from 📋 Open to ✅ Closed in Issues Dashboard Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

No branches or pull requests

2 participants