-
Notifications
You must be signed in to change notification settings - Fork 257
Description
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 48-95,144-191 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
I did
$ ~/ThunderKittens/kernels/parallel/all_reduce$ make && make run
OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 benchmark.py
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'
Failed: CUDA error /home/ubuntu/ziming/ThunderKittens/include/types/system/vmm.cuh:152 'invalid argument'