-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
behavior of availableWorkers
when Slurm --nodes
, --ntasks
and --cpus-per-task
are provided
#85
Comments
Sorry for the delay. I think we need to identify what combinations of For a starter, I've created a script (sbatch-params-all.R.txt) that # A tibble: 25 × 13
ntasks nodes cpus_per_task HOSTNAME NTASKS JOB_NUM_NODES JOB_NODELIST JOB_CPUS_PER_NODE TASKS_PER_NODE CPUS_PER_TASK CPUS_ON_NODE availableCores availableWorkers
<int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 NA NA NA c4-n12 NA 1 c4-n12 2 2 NA 2 2 2*c4-n12
2 1 NA NA c4-n12 1 1 c4-n12 2 1 NA 2 2 2*c4-n12
3 1 1 NA c4-n12 1 1 c4-n12 2 1 NA 2 2 2*c4-n12
4 1 NA 1 c4-n12 1 1 c4-n12 2 1 1 2 1 1*c4-n12
5 1 1 1 c4-n12 1 1 c4-n12 2 1 1 2 1 1*c4-n12
6 2 NA NA c4-n12 2 1 c4-n12 2 2 NA 2 2 2*c4-n12
7 2 NA 1 c4-n12 2 1 c4-n12 2 2 1 2 1 1*c4-n12
8 2 NA 2 c4-n12 2 1 c4-n12 4 2 2 4 2 2*c4-n12
9 2 1 NA c4-n12 2 1 c4-n12 2 2 NA 2 2 2*c4-n12
10 2 1 1 c4-n12 2 1 c4-n12 2 2 1 2 1 1*c4-n12
11 2 1 2 c4-n12 2 1 c4-n12 4 2 2 4 2 2*c4-n12
12 2 2 NA c4-n12 2 2 c4-n[12-13] 2(x2) 1(x2) NA 2 1 2*c4-n12, 2*c4-n13
13 2 2 1 c4-n12 2 2 c4-n[12-13] 2(x2) 1(x2) 1 2 1 1*c4-n12, 1*c4-n13
14 2 2 2 c4-n12 2 2 c4-n[12-13] 2(x2) 1(x2) 2 2 2 2*c4-n12, 2*c4-n13
15 4 NA 2 c4-n12 4 1 c4-n12 8 4 2 8 2 2*c4-n12
16 4 2 2 c4-n12 4 2 c4-n[12-13] 6,2 3,1 2 6 2 2*c4-n12, 2*c4-n13
17 16 NA NA c4-n12 16 1 c4-n12 16 16 NA 16 16 16*c4-n12
18 16 NA 4 c4-n12 16 2 c4-n[12-13] 40,24 10,6 4 40 4 4*c4-n12, 4*c4-n13
19 16 1 NA c4-n13 16 1 c4-n13 16 16 NA 16 16 16*c4-n13
20 16 4 NA c4-n1 16 4 c4-n[1-4] 10,2(x3) 10,2(x3) NA 10 10 10*c4-n1, 2*c4-n2, 2*c4-n3, 2*c4-n4
21 16 4 1 c4-n1 16 4 c4-n[1-4] 10,2(x3) 10,2(x3) 1 10 1 1*c4-n1, 1*c4-n2, 1*c4-n3, 1*c4-n4
22 NA 1-2 8 c4-n12 NA 2 c4-n[12-13] 8(x2) 1(x2) 8 8 8 8*c4-n12, 8*c4-n13
23 NA 2 8 c4-n12 NA 2 c4-n[12-13] 8(x2) 1(x2) 8 8 8 8*c4-n12, 8*c4-n13
24 NA 2 8 c4-n12 NA 2 c4-n[12-13] 8(x2) 1(x2) 8 8 8 8*c4-n12, 8*c4-n13
25 NA 4 8 c4-n1 NA 4 c4-n[1-4] 8(x4) 1(x4) 8 8 8 8*c4-n1, 8*c4-n2, 8*c4-n3, 8*c4-n4 PS. I've dropped the |
Here's a cleaned up version (sbatch-params-all.R.txt). I'm now sorting by # A tibble: 25 × 10
ntasks nodes cpus_per_task JOB_NUM_NODES JOB_NODELIST JOB_CPUS_PER_NODE CPUS_ON_NODE TASKS_PER_NODE availableCores availableWorkers
<int> <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
1 NA NA NA 1 c4-n12 2 2 2 2 2*c4-n12
2 1 NA NA 1 c4-n12 2 2 1 2 2*c4-n12
3 1 1 NA 1 c4-n12 2 2 1 2 2*c4-n12
4 16 NA NA 1 c4-n12 16 16 16 16 16*c4-n12
5 16 1 NA 1 c4-n13 16 16 16 16 16*c4-n13
6 16 4 NA 4 c4-n[3-5,37] 10,2(x3) 10 10,2(x3) 10 10*c4-n3, 2*c4-n37, 2*c4-n4, 2*c4-n5
7 2 NA NA 1 c4-n12 2 2 2 2 2*c4-n12
8 2 1 NA 1 c4-n12 2 2 2 2 2*c4-n12
9 2 2 NA 2 c4-n[12-13] 2(x2) 2 1(x2) 1 2*c4-n12, 2*c4-n13
10 1 NA 1 1 c4-n12 2 2 1 1 1*c4-n12
11 1 1 1 1 c4-n12 2 2 1 1 1*c4-n12
12 16 4 1 4 c4-n[3-5,37] 10,2(x3) 10 10,2(x3) 1 1*c4-n3, 1*c4-n37, 1*c4-n4, 1*c4-n5
13 2 NA 1 1 c4-n12 2 2 2 1 1*c4-n12
14 2 1 1 1 c4-n12 2 2 2 1 1*c4-n12
15 2 2 1 2 c4-n[12-13] 2(x2) 2 1(x2) 1 1*c4-n12, 1*c4-n13
16 2 NA 2 1 c4-n12 4 4 2 2 2*c4-n12
17 2 1 2 1 c4-n12 4 4 2 2 2*c4-n12
18 2 2 2 2 c4-n[12-13] 2(x2) 2 1(x2) 2 2*c4-n12, 2*c4-n13
19 4 NA 2 1 c4-n12 8 8 4 2 2*c4-n12
20 4 2 2 2 c4-n[12-13] 6,2 6 3,1 2 2*c4-n12, 2*c4-n13
21 16 NA 4 2 c4-n[12-13] 40,24 40 10,6 4 4*c4-n12, 4*c4-n13
22 NA 1-2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
23 NA 2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
24 NA 2 8 2 c4-n[12-13] 8(x2) 8 1(x2) 8 8*c4-n12, 8*c4-n13
25 NA 4 8 4 c4-n[3-4,38-39] 8(x4) 8 1(x4) 8 8*c4-n3, 8*c4-n38, 8*c4-n39, 8*c4-n4 |
This is a fun little puzzle! Here's one possible approach that tries to have
(*) This is not good in the case of a multi-node allocation and a user using the multisession or multicore plan, but this would mostly be user error in understanding how distributed computing works. Perhaps issue a warning or do something else in that case. If one instead had (**) A user might think they are parallelizing across all available cores in this case, but they wouldn't be. One could issue a warning, but it would trigger falsely in the case of using threading (e.g., BLAS) nested within future's workers. |
(Disclaimer: I'm not thinking about multithreading and threads per CPU core at all here) Hi. Thanks. Some quick comments for clarification/additional constraints:
The design and purpose of So, Regarding
If I understand you correctly, yes, I think with: workers <- availableWorkers()
nworkers <- length(workers)
plan(cluster, workers = availableWorkers()) which sets up a PSOCK cluster like workers <- availableWorkers()
cl <- parallelly::makeClusterPSOCK(workers) will result in
I'm not sure, I follow here. See above example saying it should indeed set up |
Regarding Regarding
|
Thanks for this. Let me digest this (=find some deep focus time to think more about it).
Correct. So, before making any changes to |
Just an updated run with more combinations:
|
Note to self: List also what CGroups and |
Now with
This was generated using |
I noticed that in this case, available workers is based on
-c
(--cpus-per-task
) and not-n
(--ntasks
).I think in this case, it would be more natural to have it report 4 workers, with the user probably wanting to use threading (e.g., linear algebra or Rcpp+openMP) within each of the 4 workers. That said, there may be cases where returning 2 available workers makes most sense.
However, the result above seems inconsistent with this next result (after all, why should it matter how many nodes the 4 tasks are running on?):
That said, I can imagine handling all the various ways a user might use
--nodes
(-N
),--ntasks
(-n
), and--cpus-per-task
(-c
) might be tricky...EDIT 2022-12-13: Add long options formats for clarification. /Henrik
The text was updated successfully, but these errors were encountered: