Add support for dropped connections #64

Andrew-S-Rosen · 2023-05-12T13:50:00Z

What should we add?

If the server connection is halted, in-progress workflows remain "running" indefinitely. A nice feature would be add some sort of support for dropped connections or server restarts.

From Will:

stopping or restarting the server does drop connections and cause in-progress workflows to be dropped. they'll just appear as "running" forever. the recommended course as of the latest release would be to redispatch using previous results. if the connection drops, same deal. at least within the slurm/ssh plugins it may be easy to add reconnect logic with some retries.

Tagging @utf who had the question/suggestion in the first place.

Describe alternatives you've considered.

You can redispatch if needed.

jackbaker1001 · 2023-06-01T03:28:31Z

@wjcunningham7 I think the issue we were discussing yesterday is either related to or is this one.

The plan is:

(1) two new constructor inputs for keepalive_interval and reconnect_retries (different from retries which are triggered when a task fails).

(2) Add reconnect attempts in the polling loop.

Looking at lines 511-535 in ~/covalent_slurm_plugin/slurm.py we have

 async def _poll_slurm(self, job_id: int, conn: asyncssh.SSHClientConnection) -> None:
        """Poll a Slurm job until completion.

        Args:
            job_id: Slurm job ID.
            conn: SSH connection object.

        Returns:
            None
        """

        # Poll status every `poll_freq` seconds
        status = await self.get_status({"job_id": str(job_id)}, conn)

        while (
            "PENDING" in status
            or "RUNNING" in status
            or "COMPLETING" in status
            or "CONFIGURING" in status
        ):
            await asyncio.sleep(self.poll_freq)
            status = await self.get_status({"job_id": str(job_id)}, conn)

        if "COMPLETED" not in status:
            raise RuntimeError("Job failed with status:\n", status)

I assume there is something we can take from conn to check if it has gone stale in some way. This should be checked every keepalive_interval for a maximum of reconnect_retries times. If it has gone stale, we just re-run lines 201-333: the method:

async def _client_connect(self) -> asyncssh.SSHClientConnection:`

jackbaker1001 · 2023-06-01T05:02:25Z

Further to the above, the thing to track is the output of conn.is_closing()

Andrew-S-Rosen added the feature New feature or functionality label May 12, 2023

Andrew-S-Rosen mentioned this issue May 12, 2023

Add support for dropped connections AgnostiqHQ/covalent-ssh-plugin#65

Closed

jackbaker1001 mentioned this issue Jun 8, 2023

Reconnect SSH after disconnect #71

Draft

3 tasks

Andrew-S-Rosen mentioned this issue Aug 17, 2023

Add support for dropped connections Quantum-Accelerators/covalent-hpc-plugin#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dropped connections #64

Add support for dropped connections #64

Andrew-S-Rosen commented May 12, 2023 •

edited

jackbaker1001 commented Jun 1, 2023

jackbaker1001 commented Jun 1, 2023

Add support for dropped connections #64

Add support for dropped connections #64

Comments

Andrew-S-Rosen commented May 12, 2023 • edited

What should we add?

Describe alternatives you've considered.

jackbaker1001 commented Jun 1, 2023

jackbaker1001 commented Jun 1, 2023

Andrew-S-Rosen commented May 12, 2023 •

edited