Skip to content

Version 0.8.0 nvFatbin , CUDA 12.x features, sync-async op unification, static targets etc.

Latest
Compare
Choose a tag to compare
@eyalroz eyalroz released this 18 Nov 15:39
· 1 commit to master since this release

Changes since v0.7.1:

Support for the nvFatbin library (#681)

  • The API wrappers now support NVIDIA's "fat binary" file format creation/marshalling library, nvFatbin. It is supported via a cuda::fatbin_builder_t class: One creates a builder, adds various fragments of fatbin-contained content (cubin, PTX, LTO IR etc.), then finally uses the build() or build_at() method to obtain the completed, final, fatbin file data, in a region of memory.
  • The project's CMake now exports a new target, cuda-api-wrappers::fatbin , which one should depend on when actually using the builder.
  • NVIDIA has not fully documented this library, so some functionality is not fully articulated, and some is only partially supported (specifically, passing extra options when adding LTO IR or PTX)

Support for more CUDA 12.x features

  • #669 : Can now obtain the kernels available in a given cuda::module_t, with the method unique_span<kernel_t> get_kernels() const.
  • #670 : Can now obtain a kernel's name and the module containing it via the kernel's handle; but - only the mangled kernel name is accessible, so giving that an appropriate method name: kernel_t::mangled_name() (regards #674)
  • #675 : Can now query CUDA's module loading mode (lazy or eager)

(Note these features are not accessible if you're using the wrappers with CUDA 11.x)

More unique_span class changes

Like a recently-cut gem, which one slowly polishes until it gains its proper shine... we had some work on unique_span in version 0.7.1 as well, and it continues in this version:

  • #678 : The deleter is now instance-specific, so it is possible to allocate in more than one way depending even on the span size - and also have the use of such unique-spans decoupled from the allocation decisions. Also, the deleter takes a span, not just a pointer, so it can make decisions based on the allocation size.
  • #665 :
    • Simplified the swap() implementation
    • Removed some redundant code
    • Shortened some code
    • Can now properly convert from a span of T to a span of const T.
    • Neither release(), nor our move construction, can be noexcept - removed that marking based only on optimism

optional_ref & partial unification of async and non-async memory operations

  • #691 : Added an optional_ref class, for passing optional arguments which are references. See this blog post by Foonathan about the problems of putting references in C++ optional's.
  • #689 : memory-related operations which had a cuda::memory::foo() and cuda::memory::async::foo() variants - now have a single variant, cuda::memory::foo(), which takes an extra optional_ref<stream_t> parameter: When it's not set, it's a synchronous operation; when it is set - the operation is asynchronous and scheduled on the stream. (But note the "fine print" w.r.t. synchronous and asynchronous CUDA operations in the Runtime and Driver API reference documents.)
  • #688 : Can now asynchronously copying 2D and 3D data using "copy parameters" structures
  • #687 : The synchronous and asynchronous variants of copy_single() had disagreed - one took a pointer, the other a reference. With their unification, they now agree (and take a pointer).

Bug fixes

Poor man's optional class

  • #664, #666 : Tweaked the class to avoid build failure in MSVC
  • #676 : value_or() now returns a value...
  • #682 : value_or() is now const

In example programs

  • #672 : The simpleCudaGraphs example program was crashing due to a gratuitous setting of launch config settings
  • #673 : Potential use-before-initialization in the simpleIPC exampl;e

Other changes

Build mechanism

  • #699 : Now exposing targets with a _static suffix, which in turn depend on the static versions of CUDA libraries, when those are available. For example, we now have both cuda-api-wrappers::rtc and cuda_api_wrappers::rtc_static
  • #694 : Now properly building fatbin files on systems with multiple GPUs of different compute capabilities

In the wrapper APIs themselves

  • #667 : Some dimension-classes methods are missing noexcept designators
  • #671 : A bit of macro renaming to avoid clashing with other libraries
  • #684 : Taking more linear sizes as size_t's in launch_config_builder_t's methods - so as to prevent narrowing-cast warnings and checking limits ourselves.
  • #686 : When loading a kernel from a library, one can now specify which context to obtain the kernel.
  • #698 : Add a shortcut function for getting the default device: cuda::device::default().

In example programs

  • #680 : Added an example program invoking cuBLAS for matrix multplication (cuBLAS is not wrapped by this library)
  • #683 : The random number generation is now the same in all vectorAdd example programs