Skip to content
mikerabat edited this page Jan 10, 2026 · 1 revision

GLobal Variables

There are a bunch of global variables and utility functions that are initialized on startup and can be manipulated during usage.

Block sizes

Internally the multiplication algorithm and the matric decomposition routines operate not only on single rows and columns but rather operate on blocks to allow cache friendly algorithms and introduce a major speedup. The default block sizes are a good tradeoff between older and newer CPU capabilities. Nevertheless one can tweak these number to get optimal runtimes for a given CPU.

These block sizes are defined in the file BlockSizeSetup.pas:

  • BlockMatrixCacheSize: The block size the matrix multiplication is split into. Typically a 128x128 block is used as inner loop.
  • BlockedMatrixMultSize: The transition size from a direct to a blocked multiplication approach - typically this is set to 512. It is higher than the BlockMatrixCacheSize since there is an overhead in the blocked multiplication.
  • BlockedVectorMatrixMultSize: Block size for a matrix vector multiplication.
  • QRBlockSize: Block size used in the QR decomposition. The routine operates typically on a 32 block.
  • QRMultBlockSize: The default block multiplication size used in the QR decomposition. Typically this is smaller (128) than the original multiplication block size since there is some cache already occupied by the QR kernels.
  • CholBlockSize: Cholesky decomposition block size. The routine typically operates on a 32 rows block.
  • SVDBlockSize: Block size the SVD kernel operates on. Typically this routine operates on 32 rows blocks.
  • HessBlockSize: Block size the Hess decomposition kernel operates on. Typically this routine operates on 32 items blocks.
  • HessMultBlockSize: Multiplication block size used in the hessian decomposition routine. Typically this is a 128x128 blocks.
  • SymEigBlockSize: The kernel used in the symmetric Eigenvalue calculation uses a typically 32 rows block size.

Note: for optimal performance such block sizes should be powers of 2 or at least multiples of 16 - this optimizes the branches in the internal assembler routines.

There are 2 (crude) procedures implemented in BlockSizeSetup.pas: SetupOptBlockMatrixSize and SetupBlockedMatrixMultSize - these show a basic way on how to determine the optimal matrix multiplication block size.

Assembler instruction sets

The unit MatrixASMStubSwitch.pas is the main unit that implements a stub for the simple matrix operations. The unit can switch the routines between different CPU instruction sets e.g. between AVX, SSE and FMA or plain Pascal code (that on a x86 platform utilizes x87 op codes and sse on x64 platforms).

The functions:

   type
     TCPUInstrType = (itFPU, itSSE, itAVX, itFMA);

   procedure InitMathFunctions(instrType : TCPUInstrType; useStrassenMult : boolean);
   procedure InitSSEOptFunctions(instrType : TCPUInstrType);
   function GetCurCPUInstrType : TCPUInstrType;

in this unit can be used to define the used instruction set. Per default the unit tries to use the "highest" instruction set (itFMA) for the most optimal experience. The InitMathFunctions is the main function that switches between instruction sets. There is an option useStrassenMult that can be used to utilize the recursive implementation of a matrix multiplication that is theoretically faster than the standard multiplication method.

GetCurCPUInstrType returns the currently used instruction set.

Note these functions do not have any impact in case the ARM target is used or the define MRMATH_NOASM is defined.

The unit CPUFeatures.pas implements a few routines for x86 and x64 targets that allows to determine what the current CPU is capable of:

   function IsSSE3Present : boolean;
   function IsAVXPresent : boolean;
   function IsAVX512Present : boolean;
   function IsFMAPresent : boolean;
   function IsHardwareRNDSupport : boolean;
   function IsHardwareRDSeed : boolean;

   function GetCurrentProcessorNumber : LongWord; register;

Basically the routines are self explanatory but can only be used in x86/x64 environments - On ARM these routines fail! The routine GetCurrentProcessorNumber can be used to determine on which CPU the current thread currently runs.

Multithreading

Basically the routines here are capable to use 64 threads at max. The unit MtxThreadPool.pas contains a few global variables that can be used for the multithreading routines. Note that modern CPU's have disticnt cores for efficiency and performance. One can set the used threads for the multithreading routins so eventually only performance cores are internally used - the reduced number of cores can actually improve overall performance since there is always a waiting barrier for e.g. the multiplications internally used in many decomposition routines.

   const cMaxNumCores = 64;                          // limit the maximum usabel cores

   type
     TNumCoreOpt = (coAll, coRealCores, coPerformanceCores);
   procedure InitNumUsedCores( optType : TNumCoreOpt );

   var numUseCPUCores : NativeInt = 0;
       numCPUCores : NativeInt = 0;
       numRealCores : NativeInt = 0;             // cores without hyperthreading
       numCoresForSimpleFuncs : NativeInt = 0;   // for median and scaling operations
       numPCores : NativeInt = 0;                // performance cores
       numECores : NativeInt = 0;                // efficiency cores
  • numUseCPUCores: The number of threads used internally in various routines e.g. multiplication, QR, SVD...
  • numCPUCores: The number of threads available on the CPU
  • numRealCores: Number of real independent cores. Hyperthreading not included.
  • numCoresForSimpleFuncs: The number of thrads used in Add/Scale and Sort/Median functions.
  • numPCores, numECores: The number of performance and efficiency cores on the system. One may want to reduce the used cores only the performance cores for optimal efficiency.

These variables are set on all platforms - only on windows the P and E cores can be determined. On the other platforms the number is the same.

Clone this wiki locally