This release provides major new functionality in the core BLIS framework, along with many other bugfixes and small changes.
Improvements present in 2.0 (June 25, 2025):
Known Issues:
- There is a performance regression in the
ztrmm
andztrsm
operations. On the Ampere Altra, performance is impacted by up to 30%; it is currently unknown if and how much this bug affects other architectures but the effect should be much smaller in most cases.
Framework:
- BLIS now supports "plugins", which provide additional functionality through user-defined kernels, blocksizes, and kernel preferences. Users can use an installed copy of BLIS (even a binary-only distribution) to create a plugin outside of the BLIS source tree. User-written reference kernels can then be registered into BLIS, and are compiled by the BLIS build system for all configured architecture. This also means that user-provided kernels participate in run-time kernel selection based on the actual hardware used! Additionally, users can provide and register optimized kernels for specific architectures which are automatically selected as appropriate. See
docs/PluginHowTo.md
for more information. - A new API has been added which allows users to modify the default "control tree". This data structure defines the specific algorithmic steps used to implement a level-3 BLAS operation such as
gemm
orsyrk
. Users can start with a predefined control tree for one of the level-3 BLAS operations (excepttrsm
currently) and then modify it to produce a custom operation. Users can change kernels for packing and computation, associated blocksizes, and provide additional information (such as external parameters or additional data) which is passed directly to the kernels. Seedocs/PluginHowTo.md
for more information and a working example. - All level-3 BLAS operations (except
trsm
) now support full mixed-precision mixed-domain computation. The A, B, and C matrices, as well as the alpha and beta scalars, may be provided in any of the supported data types (single/double precision and real/complex domain, currently), and an additionally-provided computational precision controls how the computation is actually performed internally. The computational precision can be set on theobj_t
structure representing the C matrix. - Added a
func2_t
struct for dealing with 2-type kernels (see below). Afunc2_t
can be safely cast tofunc_t
to refer to only kernels with equal type parameters. (Devin Matthews) - The
bli_*_front
functions have been removed. - Extensive other back-end changes and improvements.
- A new "level-0" macro back-end has been implemented. These macros from the basic language for implementing reference kernels and for enabling correct mixed-type computation. The new back-end specifically support full data-type flexibility, including the "computational" data-type (e.g. input/output in double, compute in single), as well as fully correct mixed-domain computation and safe in-place usage of operations such as
scal2v
. A dedicated testsuite (C++17 required) has also been added for this layer. A number of legacy macros have been retained as wrappers so that current code (e.g. optimized kernels) is not affected. - Fixed a lurking bug in
bli_obj_imag_part
which would have caused the base address to be computed incorrectly for sub-matrix objects. - Users can now force the use of a particular configuration at runtime using
BLIS_ARCH_TYPE=<name>
, where<name>
is on of the
configured sub-configurations (check the output ofconfigure
for options). This functionality existed previously, but only
using numeric configuration IDs which are undocumented.
Compatibility:
- Added a ScaLAPACK compatibility mode which disables some conflicting BLAS definitions. (Field Van Zee)
- Fixed issues with improperly escaped strings in python scripts for compatibility with python 3.12+. (@AngryLoki)
- Added a user-defined macro
BLIS_ENABLE_STD_COMPLEX
which usesstd::complex
typedefs inblis.h
for C++ code. (Devin Matthews) - Fixed a bug in the definition of some scalar level-0 macros affecting compatibility of
bli_creal
andbli_zreal
, for example. (Devin Matthews) - Fixed improperly-quoted strings in Python scripts which affected compatibility with Python 3.12+. (@AngryLoki)
- The static initializer macros (
BLIS_*_INITIALIZER
) have been fixed for compatibility with C++. (Devin Matthews) - Install "helper"
blis.h
andcblas.h
headers directly toINCDIR
(in addition to the full files inINCDIR/blis
). (Field Van Zee, Jed Brown, Mo Zhou) gemmtr
aliases for thegemmt
BLAS and CBLAS compatibility functions have been added to support recent versions of LAPACK. (Mo Zhou)
Kernels:
- Fixed an out-of-bounds read bug in the
haswell
gemmsup
kernels. (John Mather) - Fixed a bug in the complex-domain
gemm
kernels forpiledriver
. (@rmast) - Kernel, blocksizes, and preference lookup functions now use
siz_t
rather than specific enums. (Devin Matthews) - Fixed some issues with run-time kernel detection and add more ARM part numbers/manufacturer codes. (John Mather)
- Kernels can now be added which have two datatype parameters. Kernel IDs are assigned such that 1-type and 2-type kernels cannot be interchanged accidentally. (Devin Matthews)
- The packing microkernels and computational microkernels (
gemm
andgemmtrsm
) now receive offsets into the global matrix. The latter are passed via theauxinfo_t
struct. (Devin Matthews) - The separate "MRxk" and "NRxk" packing kernels have been merged into one generic packing kernel. Packing kernels are now expected to pack any size micropanel, but may optimize for specific shapes. (Devin Matthews)
- Added explicit packing kernels for diagonal portions of matrices, and for certain mixed-domain/1m cases. (Devin Matthews)
- Improved support for duplication during packing ("broadcast-B") across all packing kernels.
- Some bugs with mixed-precision/mixed-domain operations on certain architectures (esp. AVX512) have been fixed.
- Fixed bug affecting reference kernels with clang 14.
- Fixed a problem affecting row/column strides of exactly -1 with
gemm1m
. - Fixed an incompatibility between the
haswell
gemmsup
kernels and gcc 15. (Dave Love, Christopher Hillenbrand)
Build system:
- The
cblas.h
file is now "flattened" immediately afterblis.h
is (if enabled), rather than later in the build process. (Jeff Diamond, Field Van Zee) - Added script to help with preparing release candidate branches. (Field Van Zee)
- The configure script has been overhauled. In particular, using spaces in
CC
/CXX
is now supported. (Devin Matthews) - Improved support for C++ source files in BLIS or in plugins. (Devin Matthews)
- Disabled
armsve
on Windows due to build failures. (Hernan Martinez, Atsushi Tatsuma) - Added integer
BLIS_VERSION_{MAJOR,MINOR,REVISION}
macros toblis.h
so that users can check BLIS version compatibility through the C preprocessor. - Moved
#include <omp.h>
fromblis.h
to the relevant source files. (Melven Roehrig-Zoellner) - Disable building KNL with gcc 15. (Dave Love)
- Improved support for NVIDIA Fortran compilers (ifx and nvfortran), particularly in terms of selecting the correct method for returning complex numbers. (Jeff Hammond)
Testing:
- test/3 drivers now allow using the "default" induced method, rather than forcing native or 1m operation. (Field Van Zee, Leick Robinson)
- Fix some segfaults in the test/3 drivers. (Field Van Zee, Leick Robinson)
- The testsuite now tests all possible type combinations when requested. (Devin Matthews)
- Improved detection of problems in
make check-blis
and related targets. (Devin Matthews) - CI testing infrastructure has moved to CircleCI.
Documentation:
- Added documentation for the new plugin system and for creating custom operations by modifying the BLIS control tree. (Devin Matthews)
- Updated documentation for downloading BLIS in
README.md
and instructions for maintainers inRELEASING
. (Field Van Zee) - Widened print format in code examples to avoid misinterpretation of results. (Minh Quan Ho, Mason McBride)