|
| 1 | +# UCC User Guide |
| 2 | + |
| 3 | +This guide describes how to leverage UCC to accelerate collectives within a parallel |
| 4 | +programming model, e.g. MPI or OpenSHMEM, contingent on support from the particular |
| 5 | +implementation of the programming model. For simplicity, this guide uses Open MPI as |
| 6 | +one such example implementation. However, the described concepts are sufficiently |
| 7 | +general, so they should transfer to other MPI implementations or programming models |
| 8 | +as well. |
| 9 | + |
| 10 | +Note that this is not a guide on how to use the UCC API or contribute to UCC, for |
| 11 | +that see the UCC API documentation available [here](https://openucx.github.io/ucc/) |
| 12 | +and consider the technical and legal guidelines in the [contributing](../CONTRIBUTING.md) |
| 13 | +file. |
| 14 | + |
| 15 | +## Getting Started |
| 16 | + |
| 17 | +Build Open MPI with UCC as described [here](../README.md#open-mpi-and-ucc-collectives). |
| 18 | +To check if your Open MPI build supports UCC accelerated collectives, you can check for |
| 19 | +the MCA coll `ucc` component: |
| 20 | + |
| 21 | +``` |
| 22 | +$ ompi_info | grep ucc |
| 23 | + MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4) |
| 24 | +``` |
| 25 | + |
| 26 | +To execute your MPI program with UCC accelerated collectives, the `ucc` MCA component |
| 27 | +needs to be enabled: |
| 28 | + |
| 29 | +``` |
| 30 | +export OMPI_MCA_coll_ucc_enable=1 |
| 31 | +``` |
| 32 | + |
| 33 | +Currently, it is also required to set |
| 34 | + |
| 35 | +``` |
| 36 | +export OMPI_MCA_coll_ucc_priority=100 |
| 37 | +``` |
| 38 | + |
| 39 | +to work around https://github.com/open-mpi/ompi/issues/9885. |
| 40 | + |
| 41 | +In most situations, this is all that is needed to leverage UCC accelerated collectives |
| 42 | +from your MPI program. UCC heuristics aim to always select the highest performing |
| 43 | +implementation for a given collective, and UCC aims to support execution at all scales, |
| 44 | +from a single node to a full supercomputer. |
| 45 | + |
| 46 | +However, because there are many different system setups, collectives, and message sizes, |
| 47 | +these heuristics can't be perfect in all cases. The remainder of this User Guide therefore |
| 48 | +describes the parts of UCC which are necessary for basic UCC tuning. If manual tuning is |
| 49 | +necessary, an issue report is appreciated at |
| 50 | +[the Github tracker](https://github.com/openucx/ucc/issues) so that this can be considered |
| 51 | +for future tuning of UCC heuristics. |
| 52 | + |
| 53 | +Also a MPI or other programming model implementation might need to execute a collective not |
| 54 | +supported by UCC, e.g. because the datatype or reduction operator support is lacking. In |
| 55 | +these cases the implementation can't call into UCC but need to leverage another collective |
| 56 | +implementation. See [Logging](#logging) for an example how to detect this in case of Open MPI. |
| 57 | +If UCC support is missing an issue report describing the use case is appreciated at |
| 58 | +[the Github tracker](https://github.com/openucx/ucc/issues) so that this can be considered |
| 59 | +for future UCC development. |
| 60 | + |
| 61 | +## CLs and TLs |
| 62 | + |
| 63 | +UCC collective implementations are compositions of one or more **T**eam **L**ayers (TLs). |
| 64 | +TLs are designed as thin composable abstraction layers with no dependencies between |
| 65 | +different TLs. To fulfill semantic requirements of programming models like MPI and because |
| 66 | +not all TLs cover the full functionality required by a given collective (e.g. the SHARP TL |
| 67 | +does not support intra-node collectives), TLs are composed by |
| 68 | +**C**ollective **L**ayers (CLs). The list of CLs and TLs supported by the available UCC |
| 69 | +installation can be queried with: |
| 70 | + |
| 71 | +``` |
| 72 | +$ ucc_info -s |
| 73 | +Default CLs scores: basic=10 hier=50 |
| 74 | +Default TLs scores: cuda=40 nccl=20 self=50 ucp=10 |
| 75 | +``` |
| 76 | + |
| 77 | +This UCC implementations supports two CLs: |
| 78 | +- `basic`: Basic CL available for all supported algorithms and good for most use cases. |
| 79 | +- `hier`: Hierarchical CL exploiting the hierarchy on a system, e.g. NVLINK within a node |
| 80 | +and SHARP for the network. The `hier` CL exposes two hierarchy levels: `NODE` containing |
| 81 | +all ranks running on the same node and `NET` containing one rank from each node. In addition |
| 82 | +to that, there is the `FULL` subgroup with all ranks. A concrete example of a hierarchical |
| 83 | +CL is a pipeline of shared memory UCP reduce with inter-node SHARP and UCP broadcast. |
| 84 | +The `basic` CL can leverage the same TLs but would execute in a non-pipelined, |
| 85 | +less efficient fashion. |
| 86 | + |
| 87 | +and four TLs: |
| 88 | +- `cuda`: TL supporting CUDA device memory exploiting NVLINK connections between GPUs. |
| 89 | +- `nccl`: TL leveraging [NCCL](https://github.com/NVIDIA/nccl) for collectives on CUDA |
| 90 | + device memory. In many cases, UCC collectives are directly mapped to NCCL collectives. |
| 91 | + If that is not possible, a combination of NCCL collectives might be used. |
| 92 | +- `self`: TL to support collectives with only 1 participant. |
| 93 | +- `ucp`: TL building on UCP point to point communication routines from |
| 94 | + [UCX](https://github.com/openucx/ucx). This is the most general TL which supports all |
| 95 | + memory types. If required computation happens local to the memory, e.g. for CUDA device |
| 96 | + memory CUDA kernels are used for computation. |
| 97 | + |
| 98 | +In addition to those TLs supported by the example Open MPI implementation used in this guide, |
| 99 | +UCC also supports the following TLs: |
| 100 | +- `sharp`: TL leveraging the |
| 101 | + [NVIDIA **S**calable **H**ierarchical **A**ggregation and **R**eduction **P**rotocol (SHARP)™](https://docs.nvidia.com/networking/category/mlnxsharp) |
| 102 | + in-network computing features to accelerate inter-node collectives. |
| 103 | +- `rccl`: TL leveraging [RCCL](https://github.com/ROCmSoftwarePlatform/rccl) for collectives |
| 104 | + on ROCm device memory. |
| 105 | + |
| 106 | +UCC is extensible so vendors can provide additional TLs. For example the UCC binaries shipped |
| 107 | +with [HPC-X](https://developer.nvidia.com/networking/hpc-x) add the `shm` TL with optimized |
| 108 | +CPU shared memory collectives. |
| 109 | + |
| 110 | +UCC exposes environment variables to tune CL and TL selection and behavior. The list of all |
| 111 | +environment variables with a description is available from `ucc_info`: |
| 112 | + |
| 113 | +``` |
| 114 | +$ ucc_info -caf | head -15 |
| 115 | +# UCX library configuration file |
| 116 | +# Uncomment to modify values |
| 117 | +
|
| 118 | +# |
| 119 | +# UCC configuration |
| 120 | +# |
| 121 | +
|
| 122 | +# |
| 123 | +# Comma separated list of CL components to be used |
| 124 | +# |
| 125 | +# syntax: comma-separated list of: [basic|hier|all] |
| 126 | +# |
| 127 | +UCC_CLS=basic |
| 128 | +``` |
| 129 | + |
| 130 | +In this guide we will focus on how TLs are selected based on a score. Every time UCC needs |
| 131 | +to select a TL the TL with the highest score is selected considering: |
| 132 | + |
| 133 | +- The collective type |
| 134 | +- The message size |
| 135 | +- The memory type |
| 136 | +- The team size (number of ranks participating in the collective) |
| 137 | + |
| 138 | +A user can set the `UCC_TL_<NAME>_TUNE` environment variables to override the default scores |
| 139 | +following this syntax: |
| 140 | + |
| 141 | +``` |
| 142 | +UCC_TL_<NAME>_TUNE=token1#token2#...#tokenN, |
| 143 | +``` |
| 144 | + |
| 145 | +Passing a `# ` separated list of tokens to the environment variable. Each token is a `:` |
| 146 | +separated list of qualifiers: |
| 147 | + |
| 148 | +``` |
| 149 | +token=coll_type:msg_range:mem_type:team_size:score:alg |
| 150 | +``` |
| 151 | + |
| 152 | +Where each qualifier is optional. The only requirement is that either `score` or `alg` |
| 153 | +is provided. The qualifiers are |
| 154 | + |
| 155 | +- `coll_type = coll_type_1,coll_type_2,...,coll_type_n` - a `,` separated list of |
| 156 | + collective types. |
| 157 | +- `msg_range = m_start_1-m_end_1,m_start_2-m_end_2,..,m_start_n-m_end_n` - a `,` |
| 158 | + separated list of msg ranges in byte, where each range is represented by `start` |
| 159 | + and `end` values separated by `-`. Values can be integers using optional binary |
| 160 | + prefixes. Supported prefixes are `K=1<<10`, `M=1<<20`, `G=1<<30` and, `T=1<<40`. |
| 161 | + Parsing is case indepdent and a `b` can be optionally added. The special value |
| 162 | + `inf` means MAX msg size. E.g. `128`, `256b`, `4K`, `1M` are valid sizes. |
| 163 | +- `mem_type = m1,m2,..,mN` - a `,` separated list of memory types |
| 164 | +- `team_size = [t_start_1-t_end_1,t_start_2-t_end_2,...,t_start_N-t_end_N]` - a |
| 165 | + `,` separated list of team size ranges enclosed with `[]`. |
| 166 | +- `score =` , a `int` value from `0` to `inf` |
| 167 | +- `alg = @<value|str>` - character `@` followed by either the `int` or string |
| 168 | + representing the collective algorithm. |
| 169 | + |
| 170 | +Supported memory types are: |
| 171 | +- `cpu`: for CPU memory. |
| 172 | +- `cuda`: for pinned CUDA Device memory (`cudaMalloc`). |
| 173 | +- `cuda_managed`: for CUDA Managed Memory (`cudaMallocManaged`). |
| 174 | +- `rocm`: for pinned ROCm Device memory. |
| 175 | + |
| 176 | +The supported collective types and algorithms can be queried with |
| 177 | + |
| 178 | +``` |
| 179 | +$ ucc_info -A |
| 180 | +cl/hier algorithms: |
| 181 | + Allreduce |
| 182 | + 0 : rab : intra-node reduce, followed by inter-node allreduce, followed by innode broadcast |
| 183 | + 1 : split_rail : intra-node reduce_scatter, followed by PPN concurrent inter-node allreduces, followed by intra-node allgather |
| 184 | + Alltoall |
| 185 | + 0 : node_split : splitting alltoall into two concurrent a2av calls withing the node and outside of it |
| 186 | + Alltoallv |
| 187 | + 0 : node_split : splitting alltoallv into two concurrent a2av calls withing the node and outside of it |
| 188 | +[...] snip |
| 189 | +``` |
| 190 | + |
| 191 | +See the [FAQ](https://github.com/openucx/ucc/wiki/FAQ#6-what-is-tl-scoring-and-how-to-select-a-certain-tl) |
| 192 | +in the [UCC Wiki](https://github.com/openucx/ucc/wiki) for more information and concrete examples. |
| 193 | +If for a given combination, multiple TLs have the same highest score, it is implementation-defined |
| 194 | +which of those TLs with the highest score is selected. |
| 195 | + |
| 196 | +Tuning UCC heuristics is also possible with the UCC configuration file (`ucc.conf`). This file provides a unified way of tailoring the behavior of UCC components - CLs, TLs, and ECs. It can contain any UCC variables of the format `VAR = VALUE`, e.g. `UCC_TL_NCCL_TUNE=allreduce:cuda:inf#alltoall:0` to force NCCL allreduce for "cuda" buffers and disable NCCL for alltoall. See [`contrib/ucc.conf`](../contrib/ucc.conf) for an example and the [FAQ](https://github.com/openucx/ucc/wiki/FAQ#13-ucc-configuration-file-and-priority) for further details. |
| 197 | + |
| 198 | +## Logging |
| 199 | + |
| 200 | +To detect if Open MPI leverages UCC for a given collective one can set `OMPI_MCA_coll_ucc_verbose=3` checking for output like |
| 201 | + |
| 202 | +``` |
| 203 | +coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall |
| 204 | +``` |
| 205 | + |
| 206 | +For example Open MPI leverages UCC for `MPI_Alltoall` used by `osu_alltoall` from the |
| 207 | +[OSU Microbenchmarks](https://mvapich.cse.ohio-state.edu/benchmarks/) as can be seen by the log message |
| 208 | + |
| 209 | +``` |
| 210 | +coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall |
| 211 | +``` |
| 212 | + |
| 213 | +in the output of |
| 214 | + |
| 215 | +``` |
| 216 | +$ OMPI_MCA_coll_ucc_verbose=3 srun ./c/mpi/collective/osu_alltoall -i 1 -x 0 -d cuda -m 1048576:1048576 |
| 217 | +
|
| 218 | +[...] snip |
| 219 | +
|
| 220 | +# OSU MPI-CUDA All-to-All Personalized Exchange Latency Test v7.0 |
| 221 | +# Size Avg Latency(us) |
| 222 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 223 | + coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall |
| 224 | + coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall |
| 225 | + coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall |
| 226 | + coll_ucc_alltoall.c:70 - mca_coll_ucc_alltoall() running ucc alltoall |
| 227 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 228 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 229 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 230 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 231 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 232 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 233 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 234 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 235 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 236 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 237 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 238 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 239 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 240 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 241 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 242 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 243 | + coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 244 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 245 | + coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 246 | +1048576 2586333.78 |
| 247 | +
|
| 248 | +[...] snip |
| 249 | +
|
| 250 | +``` |
| 251 | + |
| 252 | +For `MPI_Alltoallw` Open MPI can't leverage UCC so the output of `osu_alltoallw` looks different. |
| 253 | +It only contains the log messages for the barriers needed for correct timing and reduces needed |
| 254 | +to calculate timing statistics: |
| 255 | + |
| 256 | +``` |
| 257 | +$ OMPI_MCA_coll_ucc_verbose=3 srun ./c/mpi/collective/osu_alltoallw -i 1 -x 0 -d cuda -m 1048576:1048576 |
| 258 | +
|
| 259 | +[...] snip |
| 260 | +
|
| 261 | +# OSU MPI-CUDA All-to-Allw Personalized Exchange Latency Test v7.0 |
| 262 | +# Size Avg Latency(us) |
| 263 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 264 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 265 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 266 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 267 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 268 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 269 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 270 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 271 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 272 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 273 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 274 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 275 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 276 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 277 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 278 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 279 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 280 | +coll_ucc_barrier.c:31 - mca_coll_ucc_barrier() running ucc barrier |
| 281 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 282 | +coll_ucc_reduce.c:70 - mca_coll_ucc_reduce() running ucc reduce |
| 283 | +1048576 11434.97 |
| 284 | +
|
| 285 | +[...] snip |
| 286 | +
|
| 287 | +``` |
| 288 | + |
| 289 | +To debug the choices made by UCC heuristics, setting `UCC_LOG_LEVEL=INFO` provides valuable |
| 290 | +information. E.g. it prints score map with all collectives, TLs and memory types supported |
| 291 | +``` |
| 292 | +[...] snip |
| 293 | + ucc_team.c:452 UCC INFO ===== COLL_SCORE_MAP (team_id 32768) ===== |
| 294 | +ucc_coll_score_map.c:185 UCC INFO Allgather: |
| 295 | +ucc_coll_score_map.c:185 UCC INFO Host: {0..inf}:TL_UCP:10 |
| 296 | +ucc_coll_score_map.c:185 UCC INFO Cuda: {0..inf}:TL_NCCL:10 |
| 297 | +ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 |
| 298 | +ucc_coll_score_map.c:185 UCC INFO Allgatherv: |
| 299 | +ucc_coll_score_map.c:185 UCC INFO Host: {0..inf}:TL_UCP:10 |
| 300 | +ucc_coll_score_map.c:185 UCC INFO Cuda: {0..16383}:TL_NCCL:10 {16K..1048575}:TL_NCCL:10 {1M..inf}:TL_NCCL:10 |
| 301 | +ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..inf}:TL_UCP:10 |
| 302 | +ucc_coll_score_map.c:185 UCC INFO Allreduce: |
| 303 | +ucc_coll_score_map.c:185 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 |
| 304 | +ucc_coll_score_map.c:185 UCC INFO Cuda: {0..4095}:TL_NCCL:10 {4K..inf}:TL_NCCL:10 |
| 305 | +ucc_coll_score_map.c:185 UCC INFO CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 |
| 306 | +ucc_coll_score_map.c:185 UCC INFO Rocm: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 |
| 307 | +ucc_coll_score_map.c:185 UCC INFO RocmManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 |
| 308 | +[...] snip |
| 309 | +``` |
| 310 | + |
| 311 | +UCC 1.2.0 or newer supports the `UCC_COLL_TRACE` environment variable: |
| 312 | + |
| 313 | +``` |
| 314 | +$ ucc_info -caf | grep -B6 UCC_COLL_TRACE |
| 315 | +# |
| 316 | +# UCC collective logging level. Higher level will result in more verbose collective info. |
| 317 | +# Possible values are: fatal, error, warn, info, debug, trace, data, func, poll. |
| 318 | +# |
| 319 | +# syntax: [FATAL|ERROR|WARN|DIAG|INFO|DEBUG|TRACE|REQ|DATA|ASYNC|FUNC|POLL] |
| 320 | +# |
| 321 | +UCC_COLL_TRACE=WARN |
| 322 | +``` |
| 323 | + |
| 324 | +With `UCC_COLL_TRACE=INFO` UCC reports for every collective which CL and TL has been selected: |
| 325 | + |
| 326 | +``` |
| 327 | +$ UCC_COLL_TRACE=INFO srun ./c/mpi/collective/osu_allreduce -i 1 -x 0 -d cuda -m 1048576:1048576 |
| 328 | +
|
| 329 | +# OSU MPI-CUDA Allreduce Latency Test v7.0 |
| 330 | +# Size Avg Latency(us) |
| 331 | +[1678205653.808236] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768 |
| 332 | +[1678205653.809882] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Allreduce sum: src={0x7fc1f3a03800, 262144, float32, Cuda}, dst={0x7fc195800000, 262144, float32, Cuda}; CL_BASIC {TL_NCCL}, team_id 32768 |
| 333 | +[1678205653.810344] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768 |
| 334 | +[1678205653.810582] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce min root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58b0, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768 |
| 335 | +[1678205653.810641] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce max root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58a8, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768 |
| 336 | +[1678205653.810651] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Reduce sum root 0: src={0x7ffef34d5898, 1, float64, Host}, dst={0x7ffef34d58a0, 1, float64, Host}; CL_BASIC {TL_UCP}, team_id 32768 |
| 337 | +1048576 582.07 |
| 338 | +[1678205653.810705] [node_name:903 :0] ucc_coll.c:255 UCC_COLL INFO coll_init: Barrier; CL_BASIC {TL_UCP}, team_id 32768 |
| 339 | +``` |
| 340 | + |
| 341 | +## Known Issues |
| 342 | + |
| 343 | +- For the CUDA and NCCL TL CUDA device dependent data structures are created when UCC |
| 344 | + is initialized which usually happens during `MPI_Init`. For these TLs it is therefore |
| 345 | + important that the GPU used by an MPI rank does not change after `MPI_Init` is called. |
| 346 | +- UCC does not support CUDA managed memory for all TLs and collectives. |
| 347 | +- Logging of collective tasks as described above using NCCL as example is not unified. |
| 348 | + E.g. some TLs do not log when a collective is started and finalized. |
| 349 | + |
| 350 | +## Other useful information |
| 351 | + |
| 352 | +- UCC FAQ: https://github.com/openucx/ucc/wiki/FAQ |
| 353 | +- Output of `ucc_info -caf` |
0 commit comments