Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sacctmgr error when running in job context #164

Open
jprnz opened this issue Nov 4, 2024 · 1 comment
Open

sacctmgr error when running in job context #164

jprnz opened this issue Nov 4, 2024 · 1 comment

Comments

@jprnz
Copy link

jprnz commented Nov 4, 2024

Two things prevent the use of this plugin on our cluster:

  1. The slurm.cfg file used for the cluster is in a non-standard location
  2. Our admins prefer us to not run anything on the login node

In order to make many of the common SLURM tools work, users of our cluster need to have SLURM_CONFIG set in their environment. Since all environmental variables prefixed with 'SLURM_*' are wiped if the plugin sees SLURM_JOB_ID, this results in sacctmgr and sinfo exiting with an error:

WorkflowError:
Unable to test the validity of the given or guessed SLURM account 'xyz' with sacctmgr: sacctmgr: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sacctmgr: error: fetch_config: DNS SRV lookup failed
sacctmgr: error: _establish_config_source: failed to fetch config
sacctmgr: fatal: Could not establish a configuration source

This seems like an unintended consequence and could be easily fixed by not removing SLURM_CONFIG.
The issue can be avoided by running:

unset SLURM_JOB_ID

Personally, I think it would be nice to set the values for slurm_account / slurm_partition via env vars (as srun / sbatch do) and, to me this seems like a sensible way to determine a default value.

Thanks for your work and continuing to help the community!

$ snakemake --version
8.25.1

$ mamba list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 0.11.1             pyhdfd78af_0    bioconda
snakemake-executor-plugin-slurm-jobstep 0.2.1              pyhdfd78af_0    bioconda

$ sinfo --version
slurm 23.02.7
@cmeesters
Copy link
Member

Thanks for reporting: I will look into it. Thankfully, you provided the solution along with the report. Meanwhile, I think the unsetting of SLURM variables will not help at all. According to SchedMD's documentation, the built-in variables are always exported. The solution must be in setting job parameters explicitly, always. Could you test code during a few iterations? I'm afraid, I might only get to it on Thursday or Friday, though.

NB:

Our admins prefer ...

Yes, I got this nonsense a lot. As if it hurts anyone, when someone produces a plot within a few seconds on a login node. (Or runs a workflow manager, which consumes about as much CPU power during the run of a workflow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants