-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job submitted in login node not being cancelled if all rules fail? #127
Comments
This is absolutely not intended behaviour. However, Snakemake will keep running until a last job has been finished (well or otherwise) to enable proper restart without loosing too many resources. Cancelling the workflow itself, will cancel all jobs. Cancelling or finishing will take time, though. Snakemake needs to check files locally and we do intentionally not check a job status every second (more configurability will be implemented, soon-ish). That being said: Can you please run your workflow with |
I attached the log for the job in the login node and the submitted job that fails. Adding --verbose did not change what is captured in the logs. |
Interesting workflow. If not already there, I hope you include in the workflow catalog. Anyway, I meant that a workfow execution with failing jobs with |
I've been using the following to run my workflow: I also added the job script itself in case it helps shed some light. |
Ah, running within a job context might give some issues. I have an idea to work on it, but could you try running your snakemake command on the login node? |
Just did that and the result is exactly the same, the log output does not change at all and I have to manually terminate the snakemake run. One thing that I noticed is that in ALL of my snakemake runs the job is submitted with the same ID according to snakemake, its ALWAYS jobid 128
But in reality all jobs have different job ids as I would expect. |
That is another strange observation. Note, the job name is constant, as it is used to keep track of the jobs within one workflow run. When another workflow is started, it is different. The SLURM job IDs, however, ought to be incremented by SLURM itself. If this is not the case, please run: $ sinfo --version
$ sacct -u $USER |
That the job(s) are failing is one issue. Explainable by the error message However, if understood correctly, ending Snakemake with Well, your logs, which were in a SLURM job, do not indicate the submission of multiple jobs, but of one job, only. Do you have a public repo and test data to try? |
I think some things might not clear from my explanation. Just to give you some background, I'm testing migrating my workflow from snakemake 7 to 8 and updating a lot of dependencies in the process (hence , the np.empty error, this is not the issue I'm talking about). As a secondary finding, I reported that when using the plugin, the log always show that jobs are submitted with "jobID" 128, but the jobs have other job IDs that are given by slurm, I have no idea if this is a problem, just pointed it out. For reproducibility, I created a very simple workflow:
With a script that fails on purpose:
And I run it with the following command line: The python script fails immediately but snakemake carries on, this time though the verbose keyword is generating things: I hope this is more clear and allows to better track the issue (if any)! |
When working with the profile, SLURM support was not an integral part of Snakemake. Snakemake had no idea of what was going on. Know, the plugin will report failed jobs, immediately when the status is known (to avoid straining the scheduler DB we ask at certain intervals, so there is no immediate feedback - otherwise a number of admins would not allow Snakemake to run on "their" cluster with a good reason). It continues working as still running jobs might create valuable results. And a user can always restart a workflow to complete them. If you, however, are observing the run and want to fix things during development, you need to signal Snakemake ( Regarding the job ID: your master log only shows one submitted job. Does it really look like this when run on the login node? |
Ok, so according to your answer it is indeed intended behavior then. As I mentioned before, I opened this issue because in snakemake v7, the default behavior is terminate snakemake upon job failure UNLESS you use the I still feel it does not make sense that if I run the test workflow without --executor (
but there are NO other downstream/parallel jobs running, yet snakemake is kept "waiting" for something that will never happen. And again, we have the Regarding the log in my last answer, that was a run in the login node via shell, and it only shows one job because there is only one job being submitted. |
I checked: For this, I can implement a new (boolean) flag. It might be handy for development purposes. |
Having a boolean flag would be really useful! It will at least allow to recover the previous bahevior from snakemake 7, which in my opinion is indeed much better for development/debugging stages. |
made a PR, please check it (#139) |
I'm trying to update my snakemake workflow to v8+
My nuisance so far is the following. I submit my workflows as a job in the login node from where snakemake with the executor do their thing. Problem is, if the jobs spawned by the executor fail, the job in the login node is not terminated leaving me with the "nuisance" of having to manually "scancel" my failed snakemake job. The log does not help either, it just shows the submission of jobs, so I'm guessing snakemake is not detecting the failed downstream jobs.
Is this an intended behavior? Am I missing an option that needs to be activated with the executor?
Previously, when I submitted a job in the login node via sbatch and snakemake v7 with the old slurm profile, upon a job failing, the job in the login node would immediately terminate (as intended, as the workflow could not continue because of downstream jobs failing).
The text was updated successfully, but these errors were encountered: