Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HELP WANTED: availableWorkers() #16

Open
7 of 11 tasks
HenrikBengtsson opened this issue Dec 29, 2016 · 1 comment
Open
7 of 11 tasks

HELP WANTED: availableWorkers() #16

HenrikBengtsson opened this issue Dec 29, 2016 · 1 comment
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Dec 29, 2016

Background

When submitting a job to the TORQUE / PBS using something like:

qsub -l nodes=3:ppn=2 myjob.sh

the scheduler will allocate 3 nodes with 2 cores each (= 6 cores total) for myjob.sh when launched. Exactly which 3 nodes is only known to myjob.sh at run time. This information is available in a file $PBS_NODEFILE written by TORQUE / PBS, e.g.

$ cat $PBS_NODEFILE
n1
n1
n8
n8
n9
n9

Other HPC job schedulers use other files / environment variables for this.

Actions

Add an availableNodes() file that searches for common environment variables and returns a vector of node names, e.g.

> availableNodes()`
[1] "n1" "n1" "n8" "n8" "n9" "n9"

If no known environment variables are found, the default fallback could be to return rep("localhost", times = availableCores().

The above would allow us to make workers = availableNodes() the new default for cluster futures (currently workers = availableCores()).

Identify these settings for the following schedulers:

  • PBS (Portable Batch System): Environment variable PBS_NODEFILE (the name of a file containing one node per line where each node is repeated "ppn" times).
  • Oracle Grid Engine (aka Sun Grid Engine, CODINE, GRD). Environment variable PE_HOSTFILE (a file, format unclear), cf. https://www.ace-net.ca/wiki/Sun_Grid_Engine
  • Slurm (Simple Linux Utility for Resource Management). Environment variable SLURM_JOB_NODELIST (list of nodes in a compressed format, e.g. instead of "tux1,tux3,tux4" it is stored as "tux[1,3-4]". Note that multiple "compressions" may exist, e.g. "compute-[0-6]-[0-15]". The number of nodes is can be verified by SLURM_JOB_NUM_NODES. The "ppn" information is in stored in SLURM_TASKS_PER_NODE).
  • LSF/OpenLava (Platform Load Sharing Facility).
    • LSB_HOSTS
  • Spark
  • OAR
  • HTCondor
  • Moab
  • PJM (https://staff.cs.manchester.ac.uk/~fumie/internal/Job_Operation_Software_en.pdf)
    • PJM_O_NODEINF - "Path of the allocated node list file. For a job to which virtual nodes are allocated, the IP addresses of the nodes where the virtual nodes are placed are written one per line."
@HenrikBengtsson
Copy link
Collaborator Author

Add validation of PBS_HOSTFILE output toward counts PBS_NP and / or PBS_NUM_NODES * PBS_NUM_PPN.

HenrikBengtsson referenced this issue in futureverse/future Jan 7, 2017
…nd PBS_NUM_NODES * PBS_NUM_PPN. If inconsistent, a warning is generated. [#118]
@HenrikBengtsson HenrikBengtsson changed the title availableWorkers() HELP WANTED: availableWorkers() Jun 5, 2020
@HenrikBengtsson HenrikBengtsson transferred this issue from futureverse/future Oct 20, 2020
@HenrikBengtsson HenrikBengtsson pinned this issue Oct 20, 2020
@HenrikBengtsson HenrikBengtsson added enhancement New feature or request help wanted Extra attention is needed labels Oct 20, 2020
@HenrikBengtsson HenrikBengtsson unpinned this issue Mar 6, 2021
@HenrikBengtsson HenrikBengtsson pinned this issue Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant