Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dropped connections #64

Open
Andrew-S-Rosen opened this issue May 12, 2023 · 2 comments
Open

Add support for dropped connections #64

Andrew-S-Rosen opened this issue May 12, 2023 · 2 comments
Labels
feature New feature or functionality

Comments

@Andrew-S-Rosen
Copy link
Contributor

Andrew-S-Rosen commented May 12, 2023

What should we add?

If the server connection is halted, in-progress workflows remain "running" indefinitely. A nice feature would be add some sort of support for dropped connections or server restarts.

From Will:

stopping or restarting the server does drop connections and cause in-progress workflows to be dropped. they'll just appear as "running" forever. the recommended course as of the latest release would be to redispatch using previous results. if the connection drops, same deal. at least within the slurm/ssh plugins it may be easy to add reconnect logic with some retries.

Tagging @utf who had the question/suggestion in the first place.

Describe alternatives you've considered.

You can redispatch if needed.

@jackbaker1001
Copy link

@wjcunningham7 I think the issue we were discussing yesterday is either related to or is this one.

The plan is:

(1) two new constructor inputs for keepalive_interval and reconnect_retries (different from retries which are triggered when a task fails).

(2) Add reconnect attempts in the polling loop.

Looking at lines 511-535 in ~/covalent_slurm_plugin/slurm.py we have

 async def _poll_slurm(self, job_id: int, conn: asyncssh.SSHClientConnection) -> None:
        """Poll a Slurm job until completion.

        Args:
            job_id: Slurm job ID.
            conn: SSH connection object.

        Returns:
            None
        """

        # Poll status every `poll_freq` seconds
        status = await self.get_status({"job_id": str(job_id)}, conn)

        while (
            "PENDING" in status
            or "RUNNING" in status
            or "COMPLETING" in status
            or "CONFIGURING" in status
        ):
            await asyncio.sleep(self.poll_freq)
            status = await self.get_status({"job_id": str(job_id)}, conn)

        if "COMPLETED" not in status:
            raise RuntimeError("Job failed with status:\n", status)

I assume there is something we can take from conn to check if it has gone stale in some way. This should be checked every keepalive_interval for a maximum of reconnect_retries times. If it has gone stale, we just re-run lines 201-333: the method:

async def _client_connect(self) -> asyncssh.SSHClientConnection:`

@jackbaker1001
Copy link

Further to the above, the thing to track is the output of conn.is_closing()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or functionality
Projects
None yet
Development

No branches or pull requests

2 participants