Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting core dumped when running the benchmark #15

Open
elliothe opened this issue Jan 6, 2024 · 2 comments
Open

Counting core dumped when running the benchmark #15

elliothe opened this issue Jan 6, 2024 · 2 comments

Comments

@elliothe
Copy link

elliothe commented Jan 6, 2024

./nvbandwidth
nvbandwidth Version: v0.2
Built from Git version: 6cefdda

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12030
CUDA Driver Version: 12030
Driver Version: 545.29.06

Device 0: NVIDIA GeForce RTX 4090
Device 1: NVIDIA GeForce RTX 4090
Device 2: NVIDIA GeForce RTX 4090
Device 3: NVIDIA GeForce RTX 4090
Device 4: NVIDIA GeForce RTX 4090
Device 5: NVIDIA GeForce RTX 4090
Device 6: NVIDIA GeForce RTX 4090
Device 7: NVIDIA GeForce RTX 4090

Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 25.50 24.26 23.90 26.30 26.28 26.38 26.09 26.31

SUM host_to_device_memcpy_ce 205.04

Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 26.77 26.47 26.25 26.63 27.10 27.11 27.11 26.83

SUM device_to_host_memcpy_ce 214.28

Running host_to_device_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 11.24 14.42 16.15 9.78 18.82 16.46 13.65 19.24

SUM host_to_device_bidirectional_memcpy_ce 119.76

Running device_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 22.35 21.16 21.04 24.01 21.34 21.40 21.41 21.53

SUM device_to_host_bidirectional_memcpy_ce 174.26

Running device_to_device_memcpy_read_ce.
Invalid value when checking the pattern at <0x7fcd8c000000>
Current offset [ 0/67108864]
Aborted (core dumped)

@jodelek
Copy link

jodelek commented Jan 6, 2024

What OS do you use? Do you have IOMMU enabled?

@denlogv
Copy link

denlogv commented Feb 7, 2024

The consumer cards from Nvidia (e.g. Ada Lovelace) don't support peer-to-peer communication and they also ditched NVlink. So, I don't think you would be able to run this test, it fails in device_to_device_memcpy... as you can see. Nvidia confirms it, too:
https://www.tomshardware.com/news/nvidia-confirms-geforce-cards-lack-p2p-support

If you want to use them in parallel you must set the variable NCCL_P2P_DISABLE=1 (it is also possible to set it system-wide in /etc/nccl.conf), otherwise it will hang. You can read more about it in this discussion:
https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366/6

It will cause a massive slowdown, though, as all data will be routed through CPU RAM. And it also doesn't seem to have any effect in nvbandwidth. All in all, a terrible mistake to invest in a worsktation with >1 RTX 4090, which I also learned the hard way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants