Skip to content

Commit 301e5cb

Browse files
authored
DOCS: added UCC User User Guide (openucx#720)
* DOCS: added UCC User User Guide * DOCS: UCC User Guide Adressed review comments from @adgargabriel. * DOCS: Added descr. of ucc.conf and UCC_COLL_TRACE * DOCS: Added UCC version supporting UCC_COLL_TRACE
1 parent 1cdc03d commit 301e5cb

File tree

1 file changed

+353
-0
lines changed

1 file changed

+353
-0
lines changed

docs/user_guide.md

Lines changed: 353 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,353 @@
1+
# UCC User Guide
2+
3+
This guide describes how to leverage UCC to accelerate collectives within a parallel
4+
programming model, e.g. MPI or OpenSHMEM, contingent on support from the particular
5+
implementation of the programming model. For simplicity, this guide uses Open MPI as
6+
one such example implementation. However, the described concepts are sufficiently
7+
general, so they should transfer to other MPI implementations or programming models
8+
as well.
9+
10+
Note that this is not a guide on how to use the UCC API or contribute to UCC, for
11+
that see the UCC API documentation available [here](https://openucx.github.io/ucc/)
12+
and consider the technical and legal guidelines in the [contributing](../CONTRIBUTING.md)
13+
file.
14+
15+
## Getting Started
16+
17+
Build Open MPI with UCC as described [here](../README.md#open-mpi-and-ucc-collectives).
18+
To check if your Open MPI build supports UCC accelerated collectives, you can check for
19+
the MCA coll `ucc` component:
20+
21+
```
22+
$ ompi_info | grep ucc
23+
MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
24+
```
25+
26+
To execute your MPI program with UCC accelerated collectives, the `ucc` MCA component
27+
needs to be enabled:
28+
29+
```
30+
export OMPI_MCA_coll_ucc_enable=1
31+
```
32+
33+
Currently, it is also required to set
34+
35+
```
36+
export OMPI_MCA_coll_ucc_priority=100
37+
```
38+
39+
to work around https://github.com/open-mpi/ompi/issues/9885.
40+
41+
In most situations, this is all that is needed to leverage UCC accelerated collectives
42+
from your MPI program. UCC heuristics aim to always select the highest performing
43+
implementation for a given collective, and UCC aims to support execution at all scales,
44+
from a single node to a full supercomputer.
45+
46+
However, because there are many different system setups, collectives, and message sizes,
47+
these heuristics can't be perfect in all cases. The remainder of this User Guide therefore
48+
describes the parts of UCC which are necessary for basic UCC tuning. If manual tuning is
49+
necessary, an issue report is appreciated at
50+
[the Github tracker](https://github.com/openucx/ucc/issues) so that this can be considered
51+
for future tuning of UCC heuristics.
52+
53+
Also a MPI or other programming model implementation might need to execute a collective not
54+
supported by UCC, e.g. because the datatype or reduction operator support is lacking. In
55+
these cases the implementation can't call into UCC but need to leverage another collective
56+
implementation. See [Logging](#logging) for an example how to detect this in case of Open MPI.
57+
If UCC support is missing an issue report describing the use case is appreciated at
58+
[the Github tracker](https://github.com/openucx/ucc/issues) so that this can be considered
59+
for future UCC development.
60+
61+
## CLs and TLs
62+
63+
UCC collective implementations are compositions of one or more **T**eam **L**ayers (TLs).
64+
TLs are designed as thin composable abstraction layers with no dependencies between
65+
different TLs. To fulfill semantic requirements of programming models like MPI and because
66+
not all TLs cover the full functionality required by a given collective (e.g. the SHARP TL
67+
does not support intra-node collectives), TLs are composed by
68+
**C**ollective **L**ayers (CLs). The list of CLs and TLs supported by the available UCC
69+
installation can be queried with:
70+
71+
```
72+
$ ucc_info -s
73+
Default CLs scores: basic=10 hier=50
74+
Default TLs scores: cuda=40 nccl=20 self=50 ucp=10
75+
```
76+
77+
This UCC implementations supports two CLs:
78+
- `basic`: Basic CL available for all supported algorithms and good for most use cases.
79+
- `hier`: Hierarchical CL exploiting the hierarchy on a system, e.g. NVLINK within a node
80+
and SHARP for the network. The `hier` CL exposes two hierarchy levels: `NODE` containing
81+
all ranks running on the same node and `NET` containing one rank from each node. In addition
82+
to that, there is the `FULL` subgroup with all ranks. A concrete example of a hierarchical
83+
CL is a pipeline of shared memory UCP reduce with inter-node SHARP and UCP broadcast.
84+
The `basic` CL can leverage the same TLs but would execute in a non-pipelined,
85+
less efficient fashion.
86+
87+
and four TLs:
88+
- `cuda`: TL supporting CUDA device memory exploiting NVLINK connections between GPUs.
89+
- `nccl`: TL leveraging [NCCL](https://github.com/NVIDIA/nccl) for collectives on CUDA
90+
device memory. In many cases, UCC collectives are directly mapped to NCCL collectives.
91+
If that is not possible, a combination of NCCL collectives might be used.
92+
- `self`: TL to support collectives with only 1 participant.
93+
- `ucp`: TL building on UCP point to point communication routines from
94+
[UCX](https://github.com/openucx/ucx). This is the most general TL which supports all
95+
memory types. If required computation happens local to the memory, e.g. for CUDA device
96+
memory CUDA kernels are used for computation.
97+
98+
In addition to those TLs supported by the example Open MPI implementation used in this guide,
99+
UCC also supports the following TLs:
100+
- `sharp`: TL leveraging the
101+
[NVIDIA **S**calable **H**ierarchical **A**ggregation and **R**eduction **P**rotocol (SHARP)™](https://docs.nvidia.com/networking/category/mlnxsharp)
102+
in-network computing features to accelerate inter-node collectives.
103+
- `rccl`: TL leveraging [RCCL](https://github.com/ROCmSoftwarePlatform/rccl) for collectives
104+
on ROCm device memory.
105+
106+
UCC is extensible so vendors can provide additional TLs. For example the UCC binaries shipped
107+
with [HPC-X](https://developer.nvidia.com/networking/hpc-x) add the `shm` TL with optimized
108+
CPU shared memory collectives.
109+
110+
UCC exposes environment variables to tune CL and TL selection and behavior. The list of all
111+
environment variables with a description is available from `ucc_info`:
112+
113+
```
114+
$ ucc_info -caf | head -15
115+
# UCX library configuration file
116+
# Uncomment to modify values
117+
118+
#
119+
# UCC configuration
120+
#
121+
122+
#
123+
# Comma separated list of CL components to be used
124+
#
125+
# syntax: comma-separated list of: [basic|hier|all]
126+
#
127+
UCC_CLS=basic
128+
```
129+
130+
In this guide we will focus on how TLs are selected based on a score. Every time UCC needs
131+
to select a TL the TL with the highest score is selected considering:
132+
133+
- The collective type
134+
- The message size
135+
- The memory type
136+
- The team size (number of ranks participating in the collective)
137+
138+
A user can set the `UCC_TL_<NAME>_TUNE` environment variables to override the default scores
139+
following this syntax:
140+
141+
```
142+
UCC_TL_<NAME>_TUNE=token1#token2#...#tokenN,
143+
```
144+
145+
Passing a `# ` separated list of tokens to the environment variable. Each token is a `:`
146+
separated list of qualifiers:
147+
148+
```
149+
token=coll_type:msg_range:mem_type:team_size:score:alg
150+
```
151+
152+
Where each qualifier is optional. The only requirement is that either `score` or `alg`
153+
is provided. The qualifiers are
154+
155+
- `coll_type = coll_type_1,coll_type_2,...,coll_type_n` - a `,` separated list of
156+
collective types.
157+
- `msg_range = m_start_1-m_end_1,m_start_2-m_end_2,..,m_start_n-m_end_n` - a `,`
158+
separated list of msg ranges in byte, where each range is represented by `start`
159+
and `end` values separated by `-`. Values can be integers using optional binary
160+
prefixes. Supported prefixes are `K=1<<10`, `M=1<<20`, `G=1<<30` and, `T=1<<40`.
161+
Parsing is case indepdent and a `b` can be optionally added. The special value
162+
`inf` means MAX msg size. E.g. `128`, `256b`, `4K`, `1M` are valid sizes.
163+
- `mem_type = m1,m2,..,mN` - a `,` separated list of memory types
164+
- `team_size = [t_start_1-t_end_1,t_start_2-t_end_2,...,t_start_N-t_end_N]` - a
165+
`,` separated list of team size ranges enclosed with `[]`.
166+
- `score =` , a `int` value from `0` to `inf`
167+
- `alg = @<value|str>` - character `@` followed by either the `int` or string
168+
representing the collective algorithm.
169+
170+
Supported memory types are:
171+
- `cpu`: for CPU memory.
172+
- `cuda`: for pinned CUDA Device memory (`cudaMalloc`).
173+
- `cuda_managed`: for CUDA Managed Memory (`cudaMallocManaged`).
174+
- `rocm`: for pinned ROCm Device memory.
175+
176+
The supported collective types and algorithms can be queried with
177+
178+
```
179+
$ ucc_info -A
180+
cl/hier algorithms:
181+
Allreduce
182+
0 : rab : intra-node reduce, followed by inter-node allreduce, followed by innode broadcast
183+
1 : split_rail : intra-node reduce_scatter, followed by PPN concurrent inter-node allreduces, followed by intra-node allgather
184+
Alltoall
185+
0 : node_split : splitting alltoall into two concurrent a2av calls withing the node and outside of it
186+
Alltoallv
187+
0 : node_split : splitting alltoallv into two concurrent a2av calls withing the node and outside of it
188+
[...] snip
189+
```
190+
191+
See the [FAQ](https://github.com/openucx/ucc/wiki/FAQ#6-what-is-tl-scoring-and-how-to-select-a-certain-tl)
192+
in the [UCC Wiki](https://github.com/openucx/ucc/wiki) for more information and concrete examples.
193+
If for a given combination, multiple TLs have the same highest score, it is implementation-defined
194+
which of those TLs with the highest score is selected.
195+
196+
Tuning UCC heuristics is also possible with the UCC configuration file (`ucc.conf`). This file provides a unified way of tailoring the behavior of UCC components - CLs, TLs, and ECs. It can contain any UCC variables of the format `VAR = VALUE`, e.g. `UCC_TL_NCCL_TUNE=allreduce:cuda:inf#alltoall:0` to force NCCL allreduce for "cuda" buffers and disable NCCL for alltoall. See [`contrib/ucc.conf`](../contrib/ucc.conf) for an example and the [FAQ](https://github.com/openucx/ucc/wiki/FAQ#13-ucc-configuration-file-and-priority) for further details.
197+
198+
## Logging
199+
200+
To detect if Open MPI leverages UCC for a given collective one can set `OMPI_MCA_coll_ucc_verbose=3` checking for output like
201+
202+
```
203+
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
204+
```
205+
206+
For example Open MPI leverages UCC for `MPI_Alltoall` used by `osu_alltoall` from the
207+
[OSU Microbenchmarks](https://mvapich.cse.ohio-state.edu/benchmarks/) as can be seen by the log message
208+
209+
```
210+
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
211+
```
212+
213+
in the output of
214+
215+
```
216+
$ OMPI_MCA_coll_ucc_verbose=3 srun ./c/mpi/collective/osu_alltoall -i 1 -x 0 -d cuda -m 1048576:1048576
217+
218+
[...] snip
219+
220+
# OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.0
221+
# Size Avg Latency(us)
222+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
223+
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
224+
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
225+
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
226+
coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall
227+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
228+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
229+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
230+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
231+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
232+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
233+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
234+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
235+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
236+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
237+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
238+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
239+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
240+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
241+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
242+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
243+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
244+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
245+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
246+
1048576 2586333.78
247+
248+
[...] snip
249+
250+
```
251+
252+
For `MPI_Alltoallw` Open MPI can't leverage UCC so the output of `osu_alltoallw` looks different.
253+
It only contains the log messages for the barriers needed for correct timing and reduces needed
254+
to calculate timing statistics:
255+
256+
```
257+
$ OMPI_MCA_coll_ucc_verbose=3 srun ./c/mpi/collective/osu_alltoallw -i 1 -x 0 -d cuda -m 1048576:1048576
258+
259+
[...] snip
260+
261+
# OSU MPI-CUDA All-to-Allw Personalized Exchange Latency Test v7.0
262+
# Size Avg Latency(us)
263+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
264+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
265+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
266+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
267+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
268+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
269+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
270+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
271+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
272+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
273+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
274+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
275+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
276+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
277+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
278+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
279+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
280+
coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier
281+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
282+
coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce
283+
1048576 11434.97
284+
285+
[...] snip
286+
287+
```
288+
289+
To debug the choices made by UCC heuristics, setting `UCC_LOG_LEVEL=INFO` provides valuable
290+
information. E.g. it prints score map with all collectives, TLs and memory types supported
291+
```
292+
[...] snip
293+
ucc_team.c:452 UCC INFO ===== COLL_SCORE_MAP (team_id 32768) =====
294+
ucc_coll_score_map.c:185 UCC INFO Allgather:
295+
ucc_coll_score_map.c:185 UCC INFO Host: {0..inf}:TL_UCP:10
296+
ucc_coll_score_map.c:185 UCC INFO Cuda: {0..inf}:TL_NCCL:10
297+
ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
298+
ucc_coll_score_map.c:185 UCC INFO Allgatherv:
299+
ucc_coll_score_map.c:185 UCC INFO Host: {0..inf}:TL_UCP:10
300+
ucc_coll_score_map.c:185 UCC INFO Cuda: {0..16383}:TL_NCCL:10 {16K..1048575}:TL_NCCL:10 {1M..inf}:TL_NCCL:10
301+
ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
302+
ucc_coll_score_map.c:185 UCC INFO Allreduce:
303+
ucc_coll_score_map.c:185 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
304+
ucc_coll_score_map.c:185 UCC INFO Cuda: {0..4095}:TL_NCCL:10 {4K..inf}:TL_NCCL:10
305+
ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
306+
ucc_coll_score_map.c:185 UCC INFO Rocm: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
307+
ucc_coll_score_map.c:185 UCC INFO RocmManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
308+
[...] snip
309+
```
310+
311+
UCC 1.2.0 or newer supports the `UCC_COLL_TRACE` environment variable:
312+
313+
```
314+
$ ucc_info -caf | grep -B6 UCC_COLL_TRACE
315+
#
316+
# UCC collective logging level. Higher level will result in more verbose collective info.
317+
# Possible values are: fatal, error, warn, info, debug, trace, data, func, poll.
318+
#
319+
# syntax: [FATAL|ERROR|WARN|DIAG|INFO|DEBUG|TRACE|REQ|DATA|ASYNC|FUNC|POLL]
320+
#
321+
UCC_COLL_TRACE=WARN
322+
```
323+
324+
With `UCC_COLL_TRACE=INFO` UCC reports for every collective which CL and TL has been selected:
325+
326+
```
327+
$ UCC_COLL_TRACE=INFO srun ./c/mpi/collective/osu_allreduce -i 1 -x 0 -d cuda -m 1048576:1048576
328+
329+
# OSU MPI-CUDA Allreduce Latency Test v7.0
330+
# Size Avg Latency(us)
331+
[1678205653.808236] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768
332+
[1678205653.809882] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Allreduce sum: src={0x7fc1f3a03800, 262144, float32, Cuda}, dst={0x7fc195800000, 262144, float32, Cuda}; CL_BASIC {TL_NCCL}, team_id 32768
333+
[1678205653.810344] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768
334+
[1678205653.810582] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce min root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58b0, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768
335+
[1678205653.810641] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce max root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58a8, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768
336+
[1678205653.810651] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce sum root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58a0, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768
337+
1048576 582.07
338+
[1678205653.810705] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768
339+
```
340+
341+
## Known Issues
342+
343+
- For the CUDA and NCCL TL CUDA device dependent data structures are created when UCC
344+
is initialized which usually happens during `MPI_Init`. For these TLs it is therefore
345+
important that the GPU used by an MPI rank does not change after `MPI_Init` is called.
346+
- UCC does not support CUDA managed memory for all TLs and collectives.
347+
- Logging of collective tasks as described above using NCCL as example is not unified.
348+
E.g. some TLs do not log when a collective is started and finalized.
349+
350+
## Other useful information
351+
352+
- UCC FAQ: https://github.com/openucx/ucc/wiki/FAQ
353+
- Output of `ucc_info -caf`

0 commit comments

Comments
 (0)