Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self._remote_func_filename is not defined when a SLURM job hits the walltime #49

Open
Andrew-S-Rosen opened this issue Apr 24, 2023 · 0 comments

Comments

@Andrew-S-Rosen
Copy link
Contributor

Andrew-S-Rosen commented Apr 24, 2023

Environment

  • Covalent version: 0.209.1
  • Covalent-Slurm plugin version: Custom branch off of develop here so that I could log in. The code in question should not be impacted by this branch.
  • Python version: 3.9
  • Operating system: Linux

What is happening?

I tried submitting a SLURM job and got the following traceback.

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 452, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 474, in run
await self._poll_slurm(slurm_job_id, conn)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 333, in _poll_slurm
raise RuntimeError("Job failed with status:\n", status)
RuntimeError: ('Job failed with status:\n', '')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_dispatcher/_core/runner.py", line 293, in _run_task
output, stdout, stderr, exception_raised = await executor._execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 421, in _execute
return await self.execute(
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent/executor/base.py", line 459, in execute
await self.teardown(task_metadata=task_metadata)
File "/home/arosen/anaconda3/envs/covalent/lib/python3.9/site-packages/covalent_slurm_plugin/slurm.py", line 505, in teardown
remote_func_filename=self._remote_func_filename,
AttributeError: 'SlurmExecutor' object has no attribute '_remote_func_filename'

My guess (?) is that self._remote_func_filename is not defined since the RuntimeError was raised.

How can we reproduce the issue?

import covalent as ct
import time

executor = ct.executor.SlurmExecutor(<redacted>)

@ct.lattice
@ct.electron(executor=executor)
def add(val1,val2):
    time.sleep(10000) # make sure the walltime is less than this
    return val1+val2

dispatch_id = ct.dispatch(add)(1,2)
result = ct.get_result(dispatch_id,wait=True)
print(result)

What should happen?

The covalent task should abort gracefully.

Any suggestions?

I think this error happens anytime the job dies unexpectedly (e.g. hits the walltime or otherwise). It doesn't seem to "terminate gracefully."

Addendum

It seems that adding the parsable: "" option fixes the lack of a returned status but otherwise the same issue arises.

@Andrew-S-Rosen Andrew-S-Rosen changed the title self._remote_func_filename is not defined when a SLURM job doesn't complete self._remote_func_filename is not defined when a SLURM job hits the walltime Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant