Counting core dumped when running the benchmark #15

elliothe · 2024-01-06T11:40:24Z

./nvbandwidth
nvbandwidth Version: v0.2
Built from Git version: 6cefdda

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12030
CUDA Driver Version: 12030
Driver Version: 545.29.06

Device 0: NVIDIA GeForce RTX 4090
Device 1: NVIDIA GeForce RTX 4090
Device 2: NVIDIA GeForce RTX 4090
Device 3: NVIDIA GeForce RTX 4090
Device 4: NVIDIA GeForce RTX 4090
Device 5: NVIDIA GeForce RTX 4090
Device 6: NVIDIA GeForce RTX 4090
Device 7: NVIDIA GeForce RTX 4090

Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 25.50 24.26 23.90 26.30 26.28 26.38 26.09 26.31

SUM host_to_device_memcpy_ce 205.04

Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 26.77 26.47 26.25 26.63 27.10 27.11 27.11 26.83

SUM device_to_host_memcpy_ce 214.28

Running host_to_device_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 11.24 14.42 16.15 9.78 18.82 16.46 13.65 19.24

SUM host_to_device_bidirectional_memcpy_ce 119.76

Running device_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 22.35 21.16 21.04 24.01 21.34 21.40 21.41 21.53

SUM device_to_host_bidirectional_memcpy_ce 174.26

Running device_to_device_memcpy_read_ce.
Invalid value when checking the pattern at <0x7fcd8c000000>
Current offset [ 0/67108864]
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

jodelek · 2024-01-06T17:42:29Z

What OS do you use? Do you have IOMMU enabled?

denlogv · 2024-02-07T10:28:45Z

The consumer cards from Nvidia (e.g. Ada Lovelace) don't support peer-to-peer communication and they also ditched NVlink. So, I don't think you would be able to run this test, it fails in device_to_device_memcpy... as you can see. Nvidia confirms it, too:
https://www.tomshardware.com/news/nvidia-confirms-geforce-cards-lack-p2p-support

If you want to use them in parallel you must set the variable NCCL_P2P_DISABLE=1 (it is also possible to set it system-wide in /etc/nccl.conf), otherwise it will hang. You can read more about it in this discussion:
https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366/6

It will cause a massive slowdown, though, as all data will be routed through CPU RAM. And it also doesn't seem to have any effect in nvbandwidth. All in all, a terrible mistake to invest in a worsktation with >1 RTX 4090, which I also learned the hard way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counting core dumped when running the benchmark #15

Counting core dumped when running the benchmark #15

elliothe commented Jan 6, 2024

jodelek commented Jan 6, 2024

denlogv commented Feb 7, 2024 •

edited

Loading

Counting core dumped when running the benchmark #15

Counting core dumped when running the benchmark #15

Comments

elliothe commented Jan 6, 2024

jodelek commented Jan 6, 2024

denlogv commented Feb 7, 2024 • edited Loading

denlogv commented Feb 7, 2024 •

edited

Loading