Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing PRRTE daemons #2128

Closed
Petter-Programs opened this issue Jan 30, 2025 · 6 comments
Closed

Removing PRRTE daemons #2128

Petter-Programs opened this issue Jan 30, 2025 · 6 comments

Comments

@Petter-Programs
Copy link

Background information

What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)

commit 43dce07ee590d2275c6d0eaecad4f13662eb5a04 (HEAD -> master, origin/master, origin/HEAD)
Author: Ralph Castain <[email protected]>
Date:   Tue Jan 28 17:31:04 2025 -0700

Compiled with --enable-prte-ft flag

What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)

commit e2a534c8698cbaccb0140663f467f128bcf8fe1f (HEAD -> master, origin/master, origin/HEAD)
Author: Ralph Castain <[email protected]>
Date:   Fri Jan 24 10:37:23 2025 -0700

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Server; Linux 5.14.x (EL9-based distribution)
  • Computer hardware: Intel Sapphire Rapids 8480
  • Network type: NDR200

Details of the problem

Hello,

I am wondering about to what extent it is possible to remove PRRTE daemons that have already been added to the DVM's allocation. Here's an example program ("say_hello.c") that just says hello 120 times:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define SLEEP_TIME 120

int main(int argc, char *argv[])
{
    for (int i = 0; i < SLEEP_TIME; i++)
    {
        printf("Hello (%d/%d)...\n", i+1, SLEEP_TIME);
        sleep(1);
    }

    return EXIT_SUCCESS;
}

Launching the DVM:

prte --host host1:1,host2:1 --daemonize

Then running it with PRRTE on host1 only:

prun --pid <PID of DVM just started> -np 1 --host host1 ./say_hello

However, now, say I want to shut down host2 (while say_hello is running). I have tried and failed to somehow manage this without shutting down the whole DVM, killing say_hello in the middle of execution (see below). So that is my question; is it possible to remove daemons from a running DVM?

Hello (1/120)...
Hello (2/120)...
Hello (3/120)...
Hello (4/120)...
Hello (5/120)...
Hello (6/120)...
Hello (7/120)...
Hello (8/120)...
Hello (9/120)...
Hello (10/120)...
Hello (11/120)...
Hello (12/120)...
Hello (13/120)...
Hello (14/120)...
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prte-host1-419823@0,0] on node host1
  Remote daemon: [prte-host1-419823@0,1] on node host2

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

These are the things I have tried.

  • Compiling with the fault tolerance stuff mentioned here and following the guide. It did not seem to make any difference when killing the remote daemon
  • Using pterm: this shuts down the whole DVM, killing the say_hello program
  • The PRRTE MCA parameters prte_max_recon_attempts (tried setting to -1 to try forever, still shut down immediately) and prte_allowed_exit_without_sync (set to 1, had no effect)
@rhc54
Copy link
Contributor

rhc54 commented Jan 30, 2025

That is correct - PRRTE currently does not support loss of a daemon. The DVM will shut down in that circumstance. There are some folks who have talked now-and-then about extending it, but there has been no progress made in that direction so far. Unsure if/when that may happen.

@rhc54 rhc54 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 30, 2025
@hppritcha
Copy link
Contributor

hppritcha commented Jan 30, 2025

Hmm... would you expect the --add-hosts option to prun with a hostfile which subtracts all the slots from one of the hosts currently in the DVM should result in termination of the daemon on the node being removed?

@rhc54
Copy link
Contributor

rhc54 commented Jan 30, 2025

Offhand, I would say "no" - I wouldn't expect the daemon to terminate. However, I've never tried something like that and honestly have no idea what the code would do in that case. Is that what you are attempting? If so, then that's a different issue.

@Petter-Programs
Copy link
Author

Hi,

Yes, this is what I am attempting, i.e. voluntarily shutting down a daemon.

I just had a go at testing the approach suggested by @hppritcha and I can confirm that it does not terminate the daemon.

@rhc54
Copy link
Contributor

rhc54 commented Jan 30, 2025

Shutting down a daemon is very different from setting its available slots to zero. The latter simply removes it from mapping operations. Shutting it down kills any executing procs on that node, breaks the communication tree, etc.

@Petter-Programs
Copy link
Author

I see, thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants