Version 0.8.0 nvFatbin , CUDA 12.x features, sync-async op unification, static targets etc.
LatestChanges since v0.7.1:
Support for the nvFatbin
library (#681)
- The API wrappers now support NVIDIA's "fat binary" file format creation/marshalling library, nvFatbin. It is supported via a
cuda::fatbin_builder_t
class: One creates a builder, adds various fragments of fatbin-contained content (cubin, PTX, LTO IR etc.), then finally uses thebuild()
orbuild_at()
method to obtain the completed, final, fatbin file data, in a region of memory. - The project's CMake now exports a new target,
cuda-api-wrappers::fatbin
, which one should depend on when actually using the builder. - NVIDIA has not fully documented this library, so some functionality is not fully articulated, and some is only partially supported (specifically, passing extra options when adding LTO IR or PTX)
Support for more CUDA 12.x features
- #669 : Can now obtain the kernels available in a given
cuda::module_t
, with the methodunique_span<kernel_t> get_kernels() const
. - #670 : Can now obtain a kernel's name and the module containing it via the kernel's handle; but - only the mangled kernel name is accessible, so giving that an appropriate method name:
kernel_t::mangled_name()
(regards #674) - #675 : Can now query CUDA's module loading mode (lazy or eager)
(Note these features are not accessible if you're using the wrappers with CUDA 11.x)
More unique_span
class changes
Like a recently-cut gem, which one slowly polishes until it gains its proper shine... we had some work on unique_span in version 0.7.1 as well, and it continues in this version:
- #678 : The deleter is now instance-specific, so it is possible to allocate in more than one way depending even on the span size - and also have the use of such unique-spans decoupled from the allocation decisions. Also, the deleter takes a span, not just a pointer, so it can make decisions based on the allocation size.
- #665 :
- Simplified the
swap()
implementation - Removed some redundant code
- Shortened some code
- Can now properly convert from a span of
T
to a span ofconst T
. - Neither
release()
, nor our move construction, can benoexcept
- removed that marking based only on optimism
- Simplified the
optional_ref
& partial unification of async and non-async memory operations
- #691 : Added an
optional_ref
class, for passing optional arguments which are references. See this blog post by Foonathan about the problems of putting references in C++ optional's. - #689 : memory-related operations which had a
cuda::memory::foo()
andcuda::memory::async::foo()
variants - now have a single variant,cuda::memory::foo()
, which takes an extraoptional_ref<stream_t>
parameter: When it's not set, it's a synchronous operation; when it is set - the operation is asynchronous and scheduled on the stream. (But note the "fine print" w.r.t. synchronous and asynchronous CUDA operations in the Runtime and Driver API reference documents.) - #688 : Can now asynchronously copying 2D and 3D data using "copy parameters" structures
- #687 : The synchronous and asynchronous variants of
copy_single()
had disagreed - one took a pointer, the other a reference. With their unification, they now agree (and take a pointer).
Bug fixes
Poor man's optional class
- #664, #666 : Tweaked the class to avoid build failure in MSVC
- #676 :
value_or()
now returns a value... - #682 :
value_or()
is now const
In example programs
- #672 : The simpleCudaGraphs example program was crashing due to a gratuitous setting of launch config settings
- #673 : Potential use-before-initialization in the simpleIPC exampl;e
Other changes
Build mechanism
- #699 : Now exposing targets with a
_static
suffix, which in turn depend on the static versions of CUDA libraries, when those are available. For example, we now have bothcuda-api-wrappers::rtc
andcuda_api_wrappers::rtc_static
- #694 : Now properly building fatbin files on systems with multiple GPUs of different compute capabilities
In the wrapper APIs themselves
- #667 : Some dimension-classes methods are missing noexcept designators
- #671 : A bit of macro renaming to avoid clashing with other libraries
- #684 : Taking more linear sizes as
size_t
's inlaunch_config_builder_t
's methods - so as to prevent narrowing-cast warnings and checking limits ourselves. - #686 : When loading a kernel from a library, one can now specify which context to obtain the kernel.
- #698 : Add a shortcut function for getting the default device:
cuda::device::default()
.