Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to avoid UB in thrust::reference #2813

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

miscco
Copy link
Collaborator

@miscco miscco commented Nov 14, 2024

We were dereferencing a nullptr because we wanted to avoid default constructing a system.

This is obvious UB and prevents us from running the tests with sanitizers, so just default construct the systems.

Addresses #1645

@miscco miscco requested review from a team as code owners November 14, 2024 13:34
@miscco miscco requested review from elstehle and griwes November 14, 2024 13:34
@miscco
Copy link
Collaborator Author

miscco commented Nov 14, 2024

@Artem-B can you check whether that fixes some of the UB for you?

We were dereferencing a nullptr because we wanted to avoid default constructing a system.

This is obvious UB and prevents us from running the tests with sanitizers, so just default construct the systems.

Addresses NVIDIA#1645
@miscco miscco force-pushed the avoid_ub_thrust_reference branch from ffe0ddb to 9763535 Compare November 14, 2024 13:44
@miscco
Copy link
Collaborator Author

miscco commented Nov 14, 2024

@Artem-B

Is that something on your side?

ptxas fatal : Unresolved extern function '_ZN6thrust35THRUST_200800___CUDA_ARCH_LIST___NS2mr19get_global_resourceINS0_26device_ptr_memory_resourceINS0_6system4cuda6detail20cuda_memory_resourceIXadL_ZN3c2h6detail19checked_cuda_mallocEPPvmEEXadL_ZL8cudaFreeSA_EENS0_7pointerIvNS0_8cuda_cub3tagENS0_16tagged_referenceIvSE_EENS0_11use_defaultEEEEEEEEEPT_v'
clang++: error: ptxas command failed with exit code 255 (use -v to see invocation)
Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1exp120240731145000.144)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
clang++: note: diagnostic msg:


@Artem-B
Copy link
Contributor

Artem-B commented Nov 14, 2024

This could be ptxas bug (see NVIDIA's bug 4924626). It will be supposedly fixed in cuda-12.8, but I do not have the details what exactly triggers it -- there was nothing obvious in the minimized reproducer, but it did seem to be related to setmaxnreg.inc.sync.aligned. Can you check if it's present in the PTX produced by clang in this case? If it has the instruction, it's probably the same issue.

@Artem-B
Copy link
Contributor

Artem-B commented Nov 14, 2024

The patch unbreaks ~500 out of 680 tests that were failing in ubsan-enabled builds. I do not run into the ptxas crash, but I'm using a very recent clang build, so it likely generates different PTX than your builds.

Remaining failing tests report ubsan failures in these places:

third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_adaptor.h:197:12: runtime error: reference binding to null pointer of type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_adaptor.h:197:12

A handful of tests failed to compile PTX with missing symbol:

ptxas fatal   : Unresolved extern function '_ZN6thrust35THRUST_200601___CUDA_ARCH_LIST___NS2mr19get_global_resourceINS0_26device_ptr_memory_resourceINS0_6system4cuda6detail20cuda_memory_resourceIXadL_ZN3c2h6detail19checked_cuda_mallocEPPvmEEXadL_ZL8cudaFreeSA_EENS0_7pointerIvNS0_8cuda_cub3tagENS0_16tagged_referenceIvSE_EENS0_11use_defaultEEEEEEEEEPT_v'
clang: error: ptxas command failed with exit code 255 (use -v to see invocation)

Reproducible on catch2_test_device_merge_sort, and catch2_test_device_three_way_partition.

This is the same issue miscco mentioned above. I'll take a closer look tomorrow.

// for dispatch. This assumes that `get_value` will not access any system
// state.
typename thrust::iterator_system<pointer>::type* system = nullptr;
typename thrust::iterator_system<pointer>::type system;
Copy link
Collaborator

@jrhemstad jrhemstad Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just pass the type explicitly instead of materializing a value to pass? That way we can avoid both needing to default construct or using a nullptr.

Suggested change
typename thrust::iterator_system<pointer>::type system;
typename thrust::iterator_system<pointer>::type system;
return convert_to_value_type<decltype(system)>();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly would also prefer if we did not need to construct systems. From the recent CI failures and the generally open design (users can bring their own systems) we have to assume that users could rely on Thrust not trying to default construct a system. Given the current code "worked" and no data of the system was actually accessed, we can assume that only the type information should be needed. I assume/hope there is a better solution.

Copy link
Contributor

@bernhardmgruber bernhardmgruber Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered how thrust::device_vector produces a system when it inits the elements during construction: It tries to get the system from the allocator. But in case the allocator has no system, it default constructs a system (thrust knows the type of the system for an allocator):

template <typename Alloc>
_CCCL_HOST_DEVICE ::cuda::std::enable_if_t<has_member_system<Alloc>::value, typename allocator_system<Alloc>::type&>
system(Alloc& a)
{
// return the allocator's system
return a.system();
}
template <typename Alloc>
_CCCL_HOST_DEVICE ::cuda::std::enable_if_t<!has_member_system<Alloc>::value, typename allocator_system<Alloc>::type>
system(Alloc&)
{
// return a copy of a value-initialized system
return typename allocator_system<Alloc>::type();
}

The default allocator for a thrust::device_vector does not seem to contain a system. So it seems we have precedence for default constructing systems when needed.

Copy link
Contributor

🟨 CI finished in 6h 03m: Pass: 97%/222 | Total: 7d 12h | Avg: 48m 40s | Max: 6h 01m | Hits: 44%/13508
  • 🟨 cub: Pass: 96%/110 | Total: 4d 01h | Avg: 53m 04s | Max: 1h 39m | Hits: 33%/2964

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  96%/102 | Total:  3d 17h | Avg: 52m 36s | Max:  1h 39m | Hits:  33%/2964  
      🟩 arm64              Pass: 100%/8   | Total:  7h 52m | Avg: 59m 00s | Max:  1h 05m
    🔍 ctk: 12.6 🔍
      🟩 11.1               Pass: 100%/15  | Total: 12h 10m | Avg: 48m 40s | Max: 57m 26s | Hits:  33%/741   
      🟩 11.8               Pass: 100%/3   | Total:  4h 04m | Avg:  1h 21m | Max:  1h 39m
      🟩 12.5               Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 08m
      🔍 12.6               Pass:  95%/88  | Total:  3d 04h | Avg: 52m 15s | Max:  1h 16m | Hits:  33%/2223  
    🚨 cudacxx: ClangCUDA18 🚨
      🔥 ClangCUDA18        Pass:   0%/4   | Total:  3h 50m | Avg: 57m 42s | Max:  1h 03m
      🟩 nvcc11.1           Pass: 100%/15  | Total: 12h 10m | Avg: 48m 40s | Max: 57m 26s | Hits:  33%/741   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  4h 04m | Avg:  1h 21m | Max:  1h 39m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 08m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  3d 00h | Avg: 51m 59s | Max:  1h 16m | Hits:  33%/2223  
    🚨 cudacxx_family: ClangCUDA 🚨
      🔥 ClangCUDA          Pass:   0%/4   | Total:  3h 50m | Avg: 57m 42s | Max:  1h 03m
      🟩 nvcc               Pass: 100%/106 | Total:  3d 21h | Avg: 52m 53s | Max:  1h 39m | Hits:  33%/2964  
    🔍 cxx: Clang18 🔍
      🟩 Clang9             Pass: 100%/6   | Total:  5h 30m | Avg: 55m 03s | Max:  1h 16m
      🟩 Clang10            Pass: 100%/3   | Total:  2h 51m | Avg: 57m 08s | Max: 58m 07s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 51m | Avg: 57m 49s | Max:  1h 00m
      🟩 Clang12            Pass: 100%/4   | Total:  3h 40m | Avg: 55m 09s | Max:  1h 00m
      🟩 Clang13            Pass: 100%/4   | Total:  3h 35m | Avg: 53m 58s | Max: 56m 55s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 45m | Avg: 56m 16s | Max: 58m 25s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 39m | Avg: 54m 51s | Max: 56m 58s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 40m | Avg: 55m 02s | Max:  1h 00m
      🟩 Clang17            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 58s | Max: 54m 30s
      🔍 Clang18            Pass:  63%/11  | Total:  9h 28m | Avg: 51m 39s | Max:  1h 05m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 30m | Avg: 45m 04s | Max: 45m 18s
      🟩 GCC7               Pass: 100%/6   | Total:  5h 12m | Avg: 52m 05s | Max: 58m 00s
      🟩 GCC8               Pass: 100%/6   | Total:  5h 06m | Avg: 51m 01s | Max: 56m 33s
      🟩 GCC9               Pass: 100%/6   | Total:  5h 15m | Avg: 52m 35s | Max: 57m 23s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 49m | Avg: 57m 29s | Max: 59m 39s
      🟩 GCC11              Pass: 100%/7   | Total:  7h 55m | Avg:  1h 07m | Max:  1h 39m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 52m | Avg: 58m 12s | Max:  1h 02m
      🟩 GCC13              Pass: 100%/16  | Total:  9h 41m | Avg: 36m 19s | Max:  1h 04m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 58m | Avg: 59m 38s | Max:  1h 04m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 57m 26s | Avg: 57m 26s | Max: 57m 26s | Hits:  33%/741   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 47s | Max: 59m 03s | Hits:  33%/1482  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  33%/741   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 08m
    🔍 cxx_family: Clang 🔍
      🔍 Clang              Pass:  91%/48  | Total:  1d 19h | Avg: 54m 27s | Max:  1h 16m
      🟩 GCC                Pass: 100%/51  | Total:  1d 18h | Avg: 49m 53s | Max:  1h 39m
      🟩 Intel              Pass: 100%/3   | Total:  2h 58m | Avg: 59m 38s | Max:  1h 04m
      🟩 MSVC               Pass: 100%/4   | Total:  3h 55m | Avg: 58m 57s | Max:  1h 00m | Hits:  33%/2964  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 08m
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  96%/102 | Total:  3d 22h | Avg: 55m 35s | Max:  1h 39m | Hits:  33%/2964  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 20s | Avg: 19m 20s | Max: 19m 20s
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
      🟩 HostLaunch         Pass: 100%/3   | Total: 54m 28s | Avg: 18m 09s | Max: 18m 42s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 15m | Avg: 25m 18s | Max: 26m 08s
    🟨 gpu
      🟨 v100               Pass:  96%/110 | Total:  4d 01h | Avg: 53m 04s | Max:  1h 39m | Hits:  33%/2964  
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  4h 04m | Avg:  1h 21m | Max:  1h 39m
      🟩 90a                Pass: 100%/4   | Total:  1h 47m | Avg: 26m 52s | Max: 33m 51s
    🟨 std
      🟨 11                 Pass:  96%/30  | Total:  1d 01h | Avg: 51m 53s | Max:  1h 11m
      🟨 14                 Pass:  96%/29  | Total:  1d 02h | Avg: 55m 47s | Max:  1h 39m | Hits:  33%/1482  
      🟨 17                 Pass:  96%/27  | Total:  1d 01h | Avg: 56m 19s | Max:  1h 16m | Hits:  33%/741   
      🟨 20                 Pass:  95%/24  | Total: 19h 02m | Avg: 47m 36s | Max:  1h 08m | Hits:  33%/741   
    
  • 🟨 thrust: Pass: 99%/109 | Total: 3d 10h | Avg: 45m 19s | Max: 6h 01m | Hits: 48%/10544

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  99%/101 | Total:  3d 05h | Avg: 45m 48s | Max:  6h 01m | Hits:  48%/10544 
      🟩 arm64              Pass: 100%/8   | Total:  5h 13m | Avg: 39m 14s | Max: 45m 51s
    🔍 ctk: 12.6 🔍
      🟩 11.1               Pass: 100%/15  | Total: 10h 20m | Avg: 41m 21s | Max:  1h 18m | Hits:  30%/2636  
      🟩 11.8               Pass: 100%/3   | Total:  2h 30m | Avg: 50m 17s | Max: 58m 17s
      🟩 12.5               Pass: 100%/4   | Total:  5h 27m | Avg:  1h 21m | Max:  1h 31m
      🔍 12.6               Pass:  98%/87  | Total:  2d 16h | Avg: 44m 10s | Max:  6h 01m | Hits:  53%/7908  
    🔍 cudacxx: nvcc12.6 🔍
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  2h 12m | Avg: 33m 13s | Max: 35m 39s
      🟩 nvcc11.1           Pass: 100%/15  | Total: 10h 20m | Avg: 41m 21s | Max:  1h 18m | Hits:  30%/2636  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 30m | Avg: 50m 17s | Max: 58m 17s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  5h 27m | Avg:  1h 21m | Max:  1h 31m
      🔍 nvcc12.6           Pass:  98%/83  | Total:  2d 13h | Avg: 44m 41s | Max:  6h 01m | Hits:  53%/7908  
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/4   | Total:  2h 12m | Avg: 33m 13s | Max: 35m 39s
      🔍 nvcc               Pass:  99%/105 | Total:  3d 08h | Avg: 45m 47s | Max:  6h 01m | Hits:  48%/10544 
    🔍 cxx: MSVC14.29 🔍
      🟩 Clang9             Pass: 100%/6   | Total:  3h 46m | Avg: 37m 45s | Max: 42m 22s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 10m | Avg: 43m 39s | Max: 51m 24s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 58m | Avg: 44m 35s | Max: 51m 52s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 54m | Avg: 43m 43s | Max: 54m 43s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 45m | Avg: 41m 23s | Max: 42m 59s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 55m | Avg: 43m 47s | Max: 52m 44s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 53m | Avg: 43m 21s | Max: 47m 27s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 46m | Avg: 41m 30s | Max: 47m 49s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 47m | Avg: 41m 46s | Max: 47m 32s
      🟩 Clang18            Pass: 100%/11  | Total:  5h 56m | Avg: 32m 22s | Max: 46m 46s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 15m | Avg: 37m 53s | Max: 41m 57s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 04m | Avg: 40m 46s | Max: 45m 30s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 15m | Avg: 42m 32s | Max: 54m 06s
      🟩 GCC9               Pass: 100%/6   | Total:  4h 00m | Avg: 40m 05s | Max: 47m 10s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 43m | Avg: 40m 48s | Max: 45m 02s
      🟩 GCC11              Pass: 100%/7   | Total:  5h 34m | Avg: 47m 51s | Max: 58m 17s
      🟩 GCC12              Pass: 100%/4   | Total:  3h 07m | Avg: 46m 45s | Max: 52m 08s
      🟩 GCC13              Pass: 100%/14  | Total:  6h 41m | Avg: 28m 40s | Max: 45m 51s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 53m | Avg: 57m 40s | Max:  1h 12m
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 18m | Avg:  1h 18m | Max:  1h 18m | Hits:  30%/2636  
      🔍 MSVC14.29          Pass:  50%/2   | Total:  7h 18m | Avg:  3h 39m | Max:  6h 01m | Hits:  30%/2636  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 47m | Avg: 53m 34s | Max:  1h 17m | Hits:  65%/5272  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  5h 27m | Avg:  1h 21m | Max:  1h 31m
    🔍 cxx_family: MSVC 🔍
      🟩 Clang              Pass: 100%/48  | Total:  1d 07h | Avg: 39m 52s | Max: 54m 43s
      🟩 GCC                Pass: 100%/49  | Total:  1d 07h | Avg: 38m 50s | Max: 58m 17s
      🟩 Intel              Pass: 100%/3   | Total:  2h 53m | Avg: 57m 40s | Max:  1h 12m
      🔍 MSVC               Pass:  80%/5   | Total: 10h 23m | Avg:  2h 04m | Max:  6h 01m | Hits:  48%/10544 
      🟩 NVHPC              Pass: 100%/4   | Total:  5h 27m | Avg:  1h 21m | Max:  1h 31m
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  99%/102 | Total:  3d 08h | Avg: 47m 25s | Max:  6h 01m | Hits:  30%/7908  
      🟩 TestCPU            Pass: 100%/4   | Total: 55m 11s | Avg: 13m 47s | Max: 29m 19s | Hits:  99%/2636  
      🟩 TestGPU            Pass: 100%/3   | Total: 48m 37s | Avg: 16m 12s | Max: 17m 47s
    🔍 std: 17 🔍
      🟩 11                 Pass: 100%/30  | Total: 17h 24m | Avg: 34m 49s | Max:  1h 08m
      🟩 14                 Pass: 100%/29  | Total: 22h 47m | Avg: 47m 10s | Max:  1h 18m | Hits:  30%/5272  
      🔍 17                 Pass:  96%/27  | Total:  1d 02h | Avg: 58m 03s | Max:  6h 01m
      🟩 20                 Pass: 100%/23  | Total: 16h 00m | Avg: 41m 46s | Max:  1h 31m | Hits:  65%/5272  
    🟨 gpu
      🟨 v100               Pass:  99%/109 | Total:  3d 10h | Avg: 45m 19s | Max:  6h 01m | Hits:  48%/10544 
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 30m | Avg: 50m 17s | Max: 58m 17s
      🟩 90a                Pass: 100%/4   | Total:  1h 49m | Avg: 27m 15s | Max: 30m 32s
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 54s | Avg: 4m 57s | Max: 7m 34s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 54s | Avg:  4m 57s | Max:  7m 34s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 54s | Avg:  4m 57s | Max:  7m 34s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 54s | Avg:  4m 57s | Max:  7m 34s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 54s | Avg:  4m 57s | Max:  7m 34s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 54s | Avg:  4m 57s | Max:  7m 34s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 54s | Avg:  4m 57s | Max:  7m 34s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 54s | Avg:  4m 57s | Max:  7m 34s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 20s | Avg:  2m 20s | Max:  2m 20s
      🟩 Test               Pass: 100%/1   | Total:  7m 34s | Avg:  7m 34s | Max:  7m 34s
    
  • 🟩 python: Pass: 100%/1 | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 222)

# Runner
184 linux-amd64-cpu16
16 linux-arm64-cpu16
13 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

@Artem-B
Copy link
Contributor

Artem-B commented Nov 16, 2024

Is that something on your side?

ptxas fatal : Unresolved extern function '_ZN6thrust35THRUST_200800___CUDA_ARCH_LIST___NS2mr19get_global_resourceINS0_26device_ptr_memory_resourceINS0_6system4cuda6detail20cuda_memory_resourceIXadL_ZN3c2h6detail19checked_cuda_mallocEPPvmEEXadL_ZL8cudaFreeSA_EENS0_7pointerIvNS0_8cuda_cub3tagENS0_16tagged_referenceIvSE_EENS0_11use_defaultEEEEEEEEEPT_v'

It appears that allocator.h calls get_global_resource() from a host/device function here:

: base(get_global_resource<Upstream>())

But the function itself is defined as a host only function:

_CCCL_HOST MR* get_global_resource()

That does not work for clang. Clang was supposed to diagnose such use of a host function from a host_device function, but it didn't happen (probably because constructors are special, or we've just missed this corner case).
Making the function host/device gets us past this particular compilation issue, but exposes or triggers other issues:

fatal error: error in backend: PTX does not support "atomic" for orderings different than"NotAtomic" or "Monotonic" for sm_60 or older, but order is: "acquire".
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version google3-trunk (f58ce1152703ca753794b8cef36da30bd2668d0f)
Target: x86_64-grtev4-linux-gnu
Thread model: posix
InstalledDir: third_party/crosstool/v18/stable/toolchain/bin

ptxas fatal   : Unresolved extern function '__cxa_guard_acquire'
clang: error: ptxas command failed with exit code 255 (use -v to see invocation)
clang version google3-trunk (f58ce1152703ca753794b8cef36da30bd2668d0f)
Target: x86_64-grtev4-linux-gnu
Thread model: posix
InstalledDir: third_party/crosstool/v18/stable/toolchain/bin

@Artem-B
Copy link
Contributor

Artem-B commented Nov 20, 2024

ptxas fatal : Unresolved extern function '__cxa_guard_acquire'

This is caused by a local static initializer here:

static MR resource;

Looks like it's an odd corner case, which may be something for clang to fix.
Static init is generally not allowed. So, local static variables are supposed to work with an "empty" initializer. That part works fine for both nvcc and clang -- both allow it here.

However, when clang sees an existing, but empty constructor, it allows it, but then that constructor is handled by the usual initializer machinery, and, if the call is not eliminated, ends up generating the guard calls. If we're using the default constructor, everything is fine.

A work-around is to pass -Xarch_device -fno-threadsafe-statics to clang and disable thread-safe guard generation on the GPU.

Passing that option to clang also appears to fix the other error about atomic -- it was apparently a side effect of the thread-safe initializer use.

@Artem-B
Copy link
Contributor

Artem-B commented Nov 22, 2024

@miscco Any thoughts/suggestions on the UB in the thrust_iterator.h ?

third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_adaptor.h:197:12: runtime error: reference binding to null pointer of type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_adaptor.h:197:12

As for the Unresolved extern function '__cxa_guard_acquire' issue, it's been fixed in upstream clang, but one would need to use -Xarch_device -fno-threadsafe-statics until clang-20 is released.

@miscco
Copy link
Collaborator Author

miscco commented Nov 25, 2024

I believe we should be able to just suppress the warning prior to Clang20 and be done with it.

@miscco
Copy link
Collaborator Author

miscco commented Nov 25, 2024

@miscco Any thoughts/suggestions on the UB in the thrust_iterator.h ?

third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_adaptor.h:197:12: runtime error: reference binding to null pointer of type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_adaptor.h:197:12

Do you have a call stack for that issue?

@Artem-B
Copy link
Contributor

Artem-B commented Nov 25, 2024

Here's one from catch2_test_warp_scan_api:

#3  0x000055555900f9e1 in __sanitizer::Die () at third_party/llvm/llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_termination.cpp:58
#4  0x0000555559016f4f in __ubsan_handle_type_mismatch_v1_abort () at third_party/llvm/llvm-project/compiler-rt/lib/ubsan/ubsan_handlers.cpp:148
#5  0x000055555916505e in thrust::THRUST_200601_600_700_800_900_NS::iterator_adaptor<thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int*, thrust::THRUST_200601_600_700_800_900_NS::use_default, thrust::THRUST_200601_600_700_800_900_NS::use_default, thrust::THRUST_200601_600_700_800_900_NS::use_default, thrust::THRUST_200601_600_700_800_900_NS::use_default, thrust::THRUST_200601_600_700_800_900_NS::use_default>::dereference (this=0x7fffffffbb88)
    at third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_adaptor.h:197
#6  0x0000555559164ec2 in thrust::THRUST_200601_600_700_800_900_NS::iterator_core_access::dereference<thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*> > (f=...)
    at third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_facade.h:194
#7  0x0000555559164b2a in thrust::THRUST_200601_600_700_800_900_NS::iterator_facade<thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int, thrust::THRUST_200601_600_700_800_900_NS::system::cpp::detail::tag, thrust::THRUST_200601_600_700_800_900_NS::random_access_traversal_tag, int&, long>::operator* (this=0x7fffffffbb88) at third_party/gpus/cccl/v2_6_0/thrust/thrust/iterator/iterator_facade.h:364
#8  0x000055555917c6ae in thrust::THRUST_200601_600_700_800_900_NS::system::detail::sequential::copy_detail::copy<thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int*> (first=..., last=..., result=0x71eb7fe06838)
    at third_party/gpus/cccl/v2_6_0/thrust/thrust/system/detail/sequential/copy.inl:64
#9  0x000055555917c679 in thrust::THRUST_200601_600_700_800_900_NS::system::detail::sequential::copy<thrust::THRUST_200601_600_700_800_900_NS::system::cpp::detail::tag, thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int*> (
    first=..., last=..., result=0x71eb7fe06838) at third_party/gpus/cccl/v2_6_0/thrust/thrust/system/detail/sequential/copy.inl:109
#10 0x000055555917c63a in thrust::THRUST_200601_600_700_800_900_NS::copy<thrust::THRUST_200601_600_700_800_900_NS::system::cpp::detail::tag, thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int*> (exec=..., first=..., last=...,
    result=0x71eb7fe06838) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/copy.inl:44
#11 0x000055555917c564 in thrust::THRUST_200601_600_700_800_900_NS::detail::two_system_copy<thrust::THRUST_200601_600_700_800_900_NS::system::cpp::detail::tag, thrust::THRUST_200601_600_700_800_900_NS::system::cpp::detail::tag, thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int*> (system1=..., system2=..., first=..., last=..., result=0x71eb7fe06838) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/copy.inl:70
#12 0x000055555917c3cb in thrust::THRUST_200601_600_700_800_900_NS::detail::allocator_traits_detail::copy_construct_range<thrust::THRUST_200601_600_700_800_900_NS::system::cpp::detail::tag, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource>, thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int*> (from_system=..., a=..., first=..., last=..., result=0x71eb7fe06838)
    at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/allocator/copy_construct_range.inl:190
#13 0x000055555917c302 in thrust::THRUST_200601_600_700_800_900_NS::detail::copy_construct_range<thrust::THRUST_200601_600_700_800_900_NS::system::cpp::detail::tag, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource>, thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*>, int*> (from_system=..., a=..., first=..., last=..., result=0x71eb7fe06838)
    at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/allocator/copy_construct_range.inl:231
#14 0x000055555917ba51 in thrust::THRUST_200601_600_700_800_900_NS::detail::contiguous_storage<int, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource> >::uninitialized_copy<thrust::THRUST_200601_600_700_800_900_NS::detail::normal_iterator<int*> > (this=0x7fffffffc438, first=..., last=..., result=...) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/contiguous_storage.inl:234
#15 0x000055555917b4c1 in thrust::THRUST_200601_600_700_800_900_NS::detail::vector_base<int, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource> >::fill_insert (this=0x7fffffffc438,
    position=..., n=1, x=@0x7fffffffc080: 32) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/vector_base.inl:922
#16 0x000055555917adb8 in thrust::THRUST_200601_600_700_800_900_NS::detail::vector_base<int, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource> >::insert_dispatch<int> (
    this=0x7fffffffc438, position=..., n=1, x=32) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/vector_base.inl:673
#17 0x000055555917acc6 in thrust::THRUST_200601_600_700_800_900_NS::detail::vector_base<int, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource> >::insert<int> (this=0x7fffffffc438,
    position=..., first=1, last=32) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/vector_base.inl:645
#18 0x000055555917abdf in thrust::THRUST_200601_600_700_800_900_NS::detail::vector_base<int, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource> >::insert (this=0x7fffffffc438,
    position=..., x=@0x7fffffffc3cc: 32) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/vector_base.inl:623
#19 0x0000555559178a9d in thrust::THRUST_200601_600_700_800_900_NS::detail::vector_base<int, thrust::THRUST_200601_600_700_800_900_NS::mr::stateless_resource_allocator<int, c2h::checked_host_memory_resource> >::push_back (this=0x7fffffffc438,
    x=@0x7fffffffc3cc: 32) at third_party/gpus/cccl/v2_6_0/thrust/thrust/detail/vector_base.inl:547
#20 0x0000555559088e00 in C_A_T_C_H_T_E_M_P_L_A_T_E_T_E_S_T_F_U_N_C_3<metal::list<> >() () at third_party/gpus/cccl/v2_6_0/cub/test/catch2_test_warp_scan_api.cu.cc:165

Afaict, we end up dereferencing m_iterator field which is NULL, and which was created (AFAICT) in frame #15 here:

new_end = m_storage.uninitialized_copy(begin(), position, new_storage.begin());

@miscco
Copy link
Collaborator Author

miscco commented Nov 26, 2024

That looks like a bug to me, because we never check whether there an allocation there

@Artem-B
Copy link
Contributor

Artem-B commented Nov 26, 2024

That looks like a bug to me,

Tracked by #2964

I believe we should be able to just suppress the warning prior to Clang20 and be done with it.

Can you elaborate on that? Unresolved reference to __cxa_guard_acquire is a hard error, so is the unresolved reference to a host-only get_global_resource(). While the former can be worked around by compiler flags, the latter still needs a source level fix.

@miscco
Copy link
Collaborator Author

miscco commented Nov 30, 2024

Can you elaborate on that? Unresolved reference to __cxa_guard_acquire is a hard error, so is the unresolved reference to a host-only get_global_resource(). While the former can be worked around by compiler flags, the latter still needs a source level fix.

Oh yeah I mean fixing get_resource but use -Xarch_device -fno-threadsafe-statics until clang-20

@Artem-B
Copy link
Contributor

Artem-B commented Dec 3, 2024

ptxas fatal : Unresolved extern function... get_global_resource()

There is indeed a bug in clang which resulted in the call to a wrong-side function not to be diagnosed: llvm/llvm-project#118415

Artem-B added a commit to Artem-B/cccl that referenced this pull request Dec 3, 2024
The function is called from a host/device constructors and may be invoked from the device code.

As it happens neither NVCC nor clang diagnose such an invalid call.
llvm/llvm-project#118415

As the result, in unoptimized builds we end up with ptxas failing with
an unresolved reference to this function because it is never generated
during GPU-side compilation.
NVIDIA#2813 (comment)
Artem-B added a commit to Artem-B/cccl that referenced this pull request Dec 3, 2024
The function is called from a host/device constructors and may be invoked from the device code.

As it happens neither NVCC nor clang diagnose such an invalid call.
llvm/llvm-project#118415

As the result, in unoptimized builds we end up with ptxas failing with
an unresolved reference to this function because it is never generated
during GPU-side compilation.
NVIDIA#2813 (comment)
Artem-B added a commit to Artem-B/cccl that referenced this pull request Dec 4, 2024
The function is called from a host/device constructors and may be invoked from the device code.

As it happens neither NVCC nor clang diagnose such an invalid call.
llvm/llvm-project#118415

As the result, in unoptimized builds we end up with ptxas failing with
an unresolved reference to this function because it is never generated
during GPU-side compilation.
NVIDIA#2813 (comment)
Artem-B added a commit to Artem-B/cccl that referenced this pull request Dec 4, 2024
The function is called from a host/device constructors and may be invoked from the device code.

As it happens neither NVCC nor clang diagnose such an invalid call.
llvm/llvm-project#118415

As the result, in unoptimized builds we end up with ptxas failing with
an unresolved reference to this function because it is never generated
during GPU-side compilation.
NVIDIA#2813 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

4 participants