Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

availableCores(): Add support for HTCondor #50

Open
HenrikBengtsson opened this issue Apr 19, 2021 · 3 comments
Open

availableCores(): Add support for HTCondor #50

HenrikBengtsson opened this issue Apr 19, 2021 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Apr 19, 2021

HTCondor users, I need your help to add support for HTCondor to availableCores():

HPC schedulers such as Slurm, SGE, and Torque/PBS set environment variables that can be queried to figure out how many CPU cores the scheduler has alloted to the job. This allows the job script to to be agile to what it is allowed to run. For example, when submitting a SGE job to use four (4) cores:

$ qsub -pe smp 4 my_script.sh

the my_script.sh script knows how many cores it got by:

ncores=${NSLOTS:-1}
echo "I am allowed to use $ncores cores on this machine"

Question: How do you achieve the same on HTCondor? Does HTCondor set environment variables in a similar way, or are there other ways to query the number of cores you've been assigned?


FWIW, I tried to search the web for how to do it, but I failed to find anything useful. The closest I found is in Section 2.5.11 of https://www.mn.uio.no/ifi/tjenester/it/hjelp/beregninger/htcondor/condor-manual.pdf:

HTCondor sets several additional environment variables for each executing job that may be useful for the job to reference.

  • _CONDOR_SCRATCH_DIR gives the directory where the job may place temporary data files. This directory is unique for every job that is run, and its contents are deleted by HTCondor when the job stops running on a machine, no matter how the job completes.

  • _CONDOR_SLOT gives the name of the slot (for SMP machines), on which the job is run. On machines with only a single slot, the value of this variable will be 1, just like the SlotID attribute in the machine's ClassAd. This setting is available in all universes. See section 3.7.1 for more details about SMP machines and their configuration.

  • CONDOR_VM equivalent to _CONDOR_SLOT described above, except that it is only available in the standard universe. NOTE: As of HTCondor version 6.9.3, this environment variable is no longer used. It will only be defined if the ALLOW_VM_CRUFT configuration variable is set to True.

  • X509_USER_PROXY gives the full path to the X.509 user proxy file if one is associated with the job. Typically, a user will specify x509userproxy in the submit description file. This setting is currently available in the local, java, and vanilla universes.

  • _CONDOR_JOB_AD is the path to a file in the job's scratch directory which contains the job ad for the currently running job. The job ad is current as of the start of the job, but is not updated during the running of the job. The job may read attributes and their values out of this file as it runs, but any changes will not be acted on in any way by HTCondor. The format is the same as the output of the condor_q -l command. This environment variable may be particularly useful in a USER_JOB_WRAPPER.

  • _CONDOR_MACHINE_ADis the path to a file in the job's scratch directory which contains the machine ad for the slot the currently running job is using. The machine ad is current as of the start of the job, but is not updated during the running of the job. The format is the same as the output of the condor_status -l command.

  • _CONDOR_JOB_IWD is the path to the initial working directory the job was born with.

  • _CONDOR_WRAPPER_ERROR_FILE is only set when the administrator has installed a USER_JOB_WRAPPER. If this file exists, HTCondor assumes that the job wrapper has failed and copies the contents of the file to the StarterLog for the administrator to debug the problem.

  • CONDOR_IDS overrides the value of configuration variable CONDOR_IDS, when set in the environment.

  • CONDOR_ID is set for scheduler universe jobs to be the same as the ClusterId attribute

@HenrikBengtsson HenrikBengtsson added the help wanted Extra attention is needed label Apr 19, 2021
@HenrikBengtsson
Copy link
Collaborator Author

@fboehm, I see you're suggesting parallelly::availableCores() and you've got a vignette on how to use your qtl2pleio package with HTCondor 👍 Do you happen to know the answer to the above HTCondor-specific questions? I don't have access to HTCondor, so I need help to add support for HTCondor to availableCores().

@achubaty
Copy link

achubaty commented Apr 2, 2024

I don't have an HTCondor setup handy to test, the docs say:

CUBACORES GOMAXPROCS JULIA_NUM_THREADS MKL_NUM_THREADS NUMEXPR_NUM_THREADS OMP_NUM_THREADS OMP_THREAD_LIMIT OPENBLAS_NUM_THREADS ROOT_MAX_THREADS TF_LOOP_PARALLEL_ITERATIONS TF_NUM_THREADS are set to the number of cpu cores provisioned to this job. Should be at least RequestCpus, but HTCondor may match a job to a bigger slot. Jobs should not spawn more than this number of cpu-bound threads, or their performance will suffer. Many third party libraries like OpenMP obey these environment variables.

@fboehm
Copy link

fboehm commented Apr 6, 2024

@HenrikBengtsson - I'm so sorry that I missed this message (from 3 years ago!) until now. @lmichael107 @CHTC has a lot of HTCondor experience, and she may be able to connect us with others at U. Wisconsin-Madison who might also have answers to some of the above HT Condor questions. I regret that I'm clueless here. My past uses of HT Condor were pretty crude in the sense that I don't think I ever understood the HT Condor variables and how to integrate them with R package functions, especially when thinking about the availableCores function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants