Skip to content

BLIS 2.0

Latest
Compare
Choose a tag to compare
@devinamatthews devinamatthews released this 15 Jan 22:14
· 44 commits to master since this release

This release provides major new functionality in the core BLIS framework, along with many other bugfixes and small changes.

Improvements present in 2.0 (June 25, 2025):

Known Issues:

  • There is a performance regression in the ztrmm and ztrsm operations. On the Ampere Altra, performance is impacted by up to 30%; it is currently unknown if and how much this bug affects other architectures but the effect should be much smaller in most cases.

Framework:

  • BLIS now supports "plugins", which provide additional functionality through user-defined kernels, blocksizes, and kernel preferences. Users can use an installed copy of BLIS (even a binary-only distribution) to create a plugin outside of the BLIS source tree. User-written reference kernels can then be registered into BLIS, and are compiled by the BLIS build system for all configured architecture. This also means that user-provided kernels participate in run-time kernel selection based on the actual hardware used! Additionally, users can provide and register optimized kernels for specific architectures which are automatically selected as appropriate. See docs/PluginHowTo.md for more information.
  • A new API has been added which allows users to modify the default "control tree". This data structure defines the specific algorithmic steps used to implement a level-3 BLAS operation such as gemm or syrk. Users can start with a predefined control tree for one of the level-3 BLAS operations (except trsm currently) and then modify it to produce a custom operation. Users can change kernels for packing and computation, associated blocksizes, and provide additional information (such as external parameters or additional data) which is passed directly to the kernels. See docs/PluginHowTo.md for more information and a working example.
  • All level-3 BLAS operations (except trsm) now support full mixed-precision mixed-domain computation. The A, B, and C matrices, as well as the alpha and beta scalars, may be provided in any of the supported data types (single/double precision and real/complex domain, currently), and an additionally-provided computational precision controls how the computation is actually performed internally. The computational precision can be set on the obj_t structure representing the C matrix.
  • Added a func2_t struct for dealing with 2-type kernels (see below). A func2_t can be safely cast to func_t to refer to only kernels with equal type parameters. (Devin Matthews)
  • The bli_*_front functions have been removed.
  • Extensive other back-end changes and improvements.
  • A new "level-0" macro back-end has been implemented. These macros from the basic language for implementing reference kernels and for enabling correct mixed-type computation. The new back-end specifically support full data-type flexibility, including the "computational" data-type (e.g. input/output in double, compute in single), as well as fully correct mixed-domain computation and safe in-place usage of operations such as scal2v. A dedicated testsuite (C++17 required) has also been added for this layer. A number of legacy macros have been retained as wrappers so that current code (e.g. optimized kernels) is not affected.
  • Fixed a lurking bug in bli_obj_imag_part which would have caused the base address to be computed incorrectly for sub-matrix objects.
  • Users can now force the use of a particular configuration at runtime using BLIS_ARCH_TYPE=<name>, where <name> is on of the
    configured sub-configurations (check the output of configure for options). This functionality existed previously, but only
    using numeric configuration IDs which are undocumented.

Compatibility:

  • Added a ScaLAPACK compatibility mode which disables some conflicting BLAS definitions. (Field Van Zee)
  • Fixed issues with improperly escaped strings in python scripts for compatibility with python 3.12+. (@AngryLoki)
  • Added a user-defined macro BLIS_ENABLE_STD_COMPLEX which uses std::complex typedefs in blis.h for C++ code. (Devin Matthews)
  • Fixed a bug in the definition of some scalar level-0 macros affecting compatibility of bli_creal and bli_zreal, for example. (Devin Matthews)
  • Fixed improperly-quoted strings in Python scripts which affected compatibility with Python 3.12+. (@AngryLoki)
  • The static initializer macros (BLIS_*_INITIALIZER) have been fixed for compatibility with C++. (Devin Matthews)
  • Install "helper" blis.h and cblas.h headers directly to INCDIR (in addition to the full files in INCDIR/blis). (Field Van Zee, Jed Brown, Mo Zhou)
  • gemmtr aliases for the gemmt BLAS and CBLAS compatibility functions have been added to support recent versions of LAPACK. (Mo Zhou)

Kernels:

  • Fixed an out-of-bounds read bug in the haswell gemmsup kernels. (John Mather)
  • Fixed a bug in the complex-domain gemm kernels for piledriver. (@rmast)
  • Kernel, blocksizes, and preference lookup functions now use siz_t rather than specific enums. (Devin Matthews)
  • Fixed some issues with run-time kernel detection and add more ARM part numbers/manufacturer codes. (John Mather)
  • Kernels can now be added which have two datatype parameters. Kernel IDs are assigned such that 1-type and 2-type kernels cannot be interchanged accidentally. (Devin Matthews)
  • The packing microkernels and computational microkernels (gemm and gemmtrsm) now receive offsets into the global matrix. The latter are passed via the auxinfo_t struct. (Devin Matthews)
  • The separate "MRxk" and "NRxk" packing kernels have been merged into one generic packing kernel. Packing kernels are now expected to pack any size micropanel, but may optimize for specific shapes. (Devin Matthews)
  • Added explicit packing kernels for diagonal portions of matrices, and for certain mixed-domain/1m cases. (Devin Matthews)
  • Improved support for duplication during packing ("broadcast-B") across all packing kernels.
  • Some bugs with mixed-precision/mixed-domain operations on certain architectures (esp. AVX512) have been fixed.
  • Fixed bug affecting reference kernels with clang 14.
  • Fixed a problem affecting row/column strides of exactly -1 with gemm1m.
  • Fixed an incompatibility between the haswell gemmsup kernels and gcc 15. (Dave Love, Christopher Hillenbrand)

Build system:

  • The cblas.h file is now "flattened" immediately after blis.h is (if enabled), rather than later in the build process. (Jeff Diamond, Field Van Zee)
  • Added script to help with preparing release candidate branches. (Field Van Zee)
  • The configure script has been overhauled. In particular, using spaces in CC/CXX is now supported. (Devin Matthews)
  • Improved support for C++ source files in BLIS or in plugins. (Devin Matthews)
  • Disabled armsve on Windows due to build failures. (Hernan Martinez, Atsushi Tatsuma)
  • Added integer BLIS_VERSION_{MAJOR,MINOR,REVISION} macros to blis.h so that users can check BLIS version compatibility through the C preprocessor.
  • Moved #include <omp.h> from blis.h to the relevant source files. (Melven Roehrig-Zoellner)
  • Disable building KNL with gcc 15. (Dave Love)
  • Improved support for NVIDIA Fortran compilers (ifx and nvfortran), particularly in terms of selecting the correct method for returning complex numbers. (Jeff Hammond)

Testing:

  • test/3 drivers now allow using the "default" induced method, rather than forcing native or 1m operation. (Field Van Zee, Leick Robinson)
  • Fix some segfaults in the test/3 drivers. (Field Van Zee, Leick Robinson)
  • The testsuite now tests all possible type combinations when requested. (Devin Matthews)
  • Improved detection of problems in make check-blis and related targets. (Devin Matthews)
  • CI testing infrastructure has moved to CircleCI.

Documentation:

  • Added documentation for the new plugin system and for creating custom operations by modifying the BLIS control tree. (Devin Matthews)
  • Updated documentation for downloading BLIS in README.md and instructions for maintainers in RELEASING. (Field Van Zee)
  • Widened print format in code examples to avoid misinterpretation of results. (Minh Quan Ho, Mason McBride)