-
Notifications
You must be signed in to change notification settings - Fork 6
Description
to prevent something like #337 from happening again, i suggest we do a few things to improve error propagation in resolve.
first, we should annotate all functions which return error values with [[nodiscard]]. this will require that we bump the c++ version to 17, but this will greatly improve the ease at which resolve allows its users to write fast, correct code. this is only a stopgap solution, of course, as the current way in which we handle errors leaves a lot to be desired.
right now, the way we handle errors is by returning a simple integer value of indeterminate value (see the transpose implementation using hip, the various stub methods which return 1, etc). there are several issues with this approach. for one, there is no canonical "error code" description, our most useful output channel for the semantics around errors consists of the logging channels. two, these methods don't even necessarily return an "error code". again, see the transpose implementation using the hip backend, which returns a sum of the error codes returned by the rocsparse functions which is semantically meaningless due to these values on their own not just representing the existence of an error but also the kind of error itself.
the issue with all of this is that it doesn't allow the people calling resolve code to consistently handle the error beyond guessing what it is. if we had logged the return value of the transpose function in #337, we would not have known much of anything from that alone, as the function itself does not bail if the buffer sizing function call fails and will sum the return values of the buffer sizing function and the csr to csc function call.
to fix this, i suggest we create our own error structure that abstracts over errors stemming from our own code and those in libraries we depend upon, such as rocm. this could be an enum, but because we need to wrap over cuda, hip, and other libraries, it's probably best to provide a way to extract out the underlying error from this, so an enum wouldn't work. this would allow calling code to properly handle the causes of errors where this is possible. we should return this structure from our own functions and annotate the structure itself with [[nodiscard]] so that just ignoring it results in a warning.