You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two things prevent the use of this plugin on our cluster:
The slurm.cfg file used for the cluster is in a non-standard location
Our admins prefer us to not run anything on the login node
In order to make many of the common SLURM tools work, users of our cluster need to have SLURM_CONFIG set in their environment. Since all environmental variables prefixed with 'SLURM_*' are wiped if the plugin sees SLURM_JOB_ID, this results in sacctmgr and sinfo exiting with an error:
WorkflowError:
Unable to test the validity of the given or guessed SLURM account 'xyz' with sacctmgr: sacctmgr: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sacctmgr: error: fetch_config: DNS SRV lookup failed
sacctmgr: error: _establish_config_source: failed to fetch config
sacctmgr: fatal: Could not establish a configuration source
This seems like an unintended consequence and could be easily fixed by not removing SLURM_CONFIG.
The issue can be avoided by running:
unset SLURM_JOB_ID
Personally, I think it would be nice to set the values for slurm_account / slurm_partition via env vars (as srun / sbatch do) and, to me this seems like a sensible way to determine a default value.
Thanks for your work and continuing to help the community!
Thanks for reporting: I will look into it. Thankfully, you provided the solution along with the report. Meanwhile, I think the unsetting of SLURM variables will not help at all. According to SchedMD's documentation, the built-in variables are always exported. The solution must be in setting job parameters explicitly, always. Could you test code during a few iterations? I'm afraid, I might only get to it on Thursday or Friday, though.
NB:
Our admins prefer ...
Yes, I got this nonsense a lot. As if it hurts anyone, when someone produces a plot within a few seconds on a login node. (Or runs a workflow manager, which consumes about as much CPU power during the run of a workflow.)
Two things prevent the use of this plugin on our cluster:
In order to make many of the common SLURM tools work, users of our cluster need to have
SLURM_CONFIG
set in their environment. Since all environmental variables prefixed with 'SLURM_*' are wiped if the plugin seesSLURM_JOB_ID
, this results insacctmgr
andsinfo
exiting with an error:This seems like an unintended consequence and could be easily fixed by not removing
SLURM_CONFIG
.The issue can be avoided by running:
unset SLURM_JOB_ID
Personally, I think it would be nice to set the values for
slurm_account
/slurm_partition
via env vars (assrun
/sbatch
do) and, to me this seems like a sensible way to determine a default value.Thanks for your work and continuing to help the community!
The text was updated successfully, but these errors were encountered: