Releases: lattice/quda
QUDA v1.1.0
Version 1.1.0 - October 2021
-
Add support for NVSHMEM communication for the Dslash operators, for significantly improved strong scaling. See https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM for more details.
-
Addition of the MSPCG preconditioned CG solver for Möbius fermions. See https://github.com/lattice/quda/wiki/The-Multi-Splitting-Preconditioned-Conjugate-Gradient-(MSPCG),-an-application-of-the-additive-Schwarz-Method for more details.
-
Addition of the Exact One Flavor Algorithm (EOFA) for Möbius fermions. See https://github.com/lattice/quda/wiki/The-Exact-One-Flavor-Algorithm-(EOFA) for more details.
-
Addition of a fully GPU native Implicitly Restarted Arnoldi eigensolver (as opposed to partially relying on ARPACK). See https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#implicitly-restarted-arnoldi-eigensolver for more details.
-
Significantly reduced latency for reduction kernels through the use of heterogeneous atomics. Requires CUDA 11.0+.
-
Addition of support for a split-grid multi-RHS solver. See https://github.com/lattice/quda/wiki/Split-Grid for more details.
-
Continued work on enhancing and refining the staggered multigrid algorithm. The MILC interface can now drive the staggered multigrid solver.
-
Multigrid setup can now use tensor cores on Volta, Turing and Ampere GPUs to accelerate the calculation. Enable with the
QudaMultigridParam::use_mma
parameter. -
Improved support of managed memory through the addition of a prefetch API. This can dramatically improve the performance of the multigrid setup when oversubscribing the memory.
-
Improved the performance of using MILC RHMC with QUDA
-
Add support for a new internal data order FLOAT8. This is the default data order for nSpin=4 half and quarter precision fields,
though the prior FLOAT4 order can be enabled with the cmake option QUDA_FLOAT8=OFF. -
Remove of the singularity from the reconstruct-8 and reconstruct-9 compressed gauge field ordering. This enables support for free fields with these orderings.
-
The clover parameter convention has been codified: one can either
1.) pass in QudaInvertParam::kappa and QudaInvertParam::csw separately, and QUDA will infer the necessary clover coefficient, or
2.) pass an explicit value of QudaInvertParam::clover_coeff (e.g. CHROMA's use case) and that will override the above inference. -
QUDA now includes fast-compilation options (QUDA_FAST_COMPILE_DSLASH and QUDA_FAST_COMPILE_REUDCE) which enable much faster build times for development at the expense of reduced performance.
-
Add support for compiling QUDA using clang for both the host and device compiler.
-
While the bulk of the work associated with making QUDA portable to different architectures will form the soul of QUDA 2.0, some of the initial refactoring associated with this has been applied.
-
Significant cleanup of the tests directory to reduce boiler plate.
-
General improvements to the cmake build system using modern cmake features. We now require cmake 3.15.
-
Extended the ctest list to include some optional benchmarks.
-
Fix a long-standing issue with multi-node Kepler GPU and Intel dual socket systems.
-
Improved ASAN integration: SANITIZE builds now work out of the box with no need to set the ASAN_OPTIONS environment variable.
-
Add support for the extended QIO branch (now required for MILC).
-
Bump QMP version to 2.5.3.
-
Updated to Eigen 3.3.9.
-
Multiple bug fixes and clean up to the library. Many of these are listed here: https://github.com/lattice/quda/milestone/24?closed=1
QUDA v1.0.0
Version 1.0.0 - 10 January 2020
-
Add support for CUDA 10.2: QUDA 1.0.0 is supported on CUDA 7.5-10.2
using either GCC or clang compilers. CUDA 10.x and either GCC >=
6.x or clang >= 6.x are highly recommended. -
Significant improvements to the CMake build system and removal of the
legacy configure build. -
Added more targeted compilation options to constrain which
precisions and reconstruct types are compiled. QUDA_PRECISION is a
cmake parameter that is a 4-bit number corresponding to which
precisions are enabled, with 1 = quarter, 2 = half, 4 = single and 8
= double, the default is 14 which enables double, single and half
precision. QUDA_RECONSTRUCT is a 3-bit number corresponding to
which reconstruct types are enabled, with 1 = reconstruct-8/9, 2 =
reconstruct-12/13 and 4 = reconstruct-18, the default is 7 which
enables all reconstruct types. -
Completely rewritten all dslash kernels using the accessor
framework. This dramatically reduces code complexity and improve
performance. -
New physics functionality added: gauge Laplace kernel, Gaussian
quark smearing, topological charge density. -
QUDA can now be built to either utilize texture-memory reads or to
use direct memory accessing (cmake option QUDA_TEX). The default
has textures on, though we note that since Pascal it can be
advantageous to disable textures and utilize direct reads. -
QUDA is no longer supported on the Fermi generation of GPUs (sm_20
and sm_21). Compilation and running should still be possible but
will require compilation with texture objects disabled. -
Added supported for quarter precision (QUDA_QUARTER_PRECISION) for
the linear operator and associated solvers. -
Implemented both CA-CG and CA-GCR communication avoid solvers, for
use either as stand-alone solvers or as a means to accelerate
multigrid. -
Continued evolution and optimization of the multigrid framework.
Regardless, we advise users to use the latest develop branch when
using multigrid, since it continues to be a fast-moving target with
continual focus on optimization and improvement. -
An implementation of the Thick Restarted Lanczos Method (TRLM) for
eigenvector solving of the normal operator. -
Lanczos-accelerated multigrid through the use of coarse-grid
deflation and / or using singular vectors to define the prolongator. -
Removal of the legacy contraction and co-variant derivative
algorithms, and replacement with accessor-based rewrites. -
Improved heavy-quark residual convergence which ensure correct
convergence for MILC heavy quark observables. -
Experimental support for Just-In-Time (JIT) compilation using Jitify.
-
Significantly improved unit testing framework using ctest.
-
QUDA can now be built to target Google's address sanitizer
(CMAKE_BUILD_TYPE option is SANITIZE) for improved debugging. -
QUDA can now download and install the USQCD libraries QMP and QIO
automatically as part of the compilation process. To enable this,
the option QUDA_DOWNLOAD_USQCD=ON should be set. Similarly to Eigen
installation this requires access to the outside internet. -
QUDA can now download and install the ARPACK library automatically
if the QUDA_DOWNLOAD_ARPACK option is enabled. -
Updated to CUB 1.8.
-
Multiple bug fixes and clean up to the library. Many of these are
listed here: https://github.com/lattice/quda/milestone/21?closed=1
QUDA v0.9.0
Version 0.9.0 - 24 July 2018
-
Add support for CUDA 9.x: QUDA 0.9.0 is supported on CUDA 7.0-9.2.
-
Continued focus on optimization of multi-GPU execution, with
particular emphasis on Dslash scaling. For more details on
optimizing multi-GPU performance, see
https://github.com/lattice/quda/wiki/Multi-GPU-Support -
On systems that support it, QUDA now uses direct peer-to-peer
communication between GPUs with in the same node. The Dslash policy
autotuner will ascertain the optimal commuication route to take,
whether it be to route through CPU memory, use DMA copy engines or
directly write the halo buffer to neighboring GPUs. -
On systems that support it, QUDA will take advantage of GPU Direct
RDMA. This is enabled through setting the environment variable
QUDA_ENABLE_GDR=1 which will augment the dslash tuning policies to
include policies using GPU-aware MPI to facilitate direct GPU-NIC
communication. This can improve strong scaling by up to 3x. -
Improved precision when using half precision (use rounding instead
of truncation when converting to/from float). -
Add support for symmetric preconditioning for 4-d preconditioned
Shamir and Mobius Dirac operators. -
Added initial support for multi-right-hand-side staggered Dirac
operator (treat the rhs index as a fifth dimension). -
Added initial implementation of block CG linear solver.
-
Added BiCGStab(l) linear solver. The parameter "l" corresponds to
the size of the space to perform GCR-style residual minimization.
This is typically much better behaved than BiCGStab for the Wilson
and Wilson-clover linear systems. -
Initial version of adaptive multigrid fully implemented into QUDA.
-
Creation of multi-blas and multi-reduction framework, this is
essential for high performance for pipelined, block and
communication-avoiding solvers that work on "matrices of vectors" as
opposed to "scalars of vectors". The max tile size used by the
multi-blas framework is set by QUDA_MAX_MULTI_BLAS_N cmake
parameter, which default to 4 for reduced compile time. For
production use of such solvers, this should be increase to 8..16. -
Optimization of multi-shift solver using multi-blas framework to permit
kernel fusion of all shift updates. -
Complete rewrite and optimization of clover inversion, HISQ force
kernels, HISQ link fattening algorithms using accessors. -
QUDA can now directly load/store from MILC's site structure array.
This removes the need to unpack and pack data prior to calling QUDA,
and dramatically reduces CPU overhead. -
Removal of legacy data structures and kernels. In particular
original single-GPU only ASQTAD fermion force has been removed. -
Implementation of STOUT fattening kernel.
-
Significant improvement to the cmake build system to improve
compilation speed and aid productivity. In particular, QUDA now
supports being built as a shared library which greatly reduces link
time. -
Autoconf and configure build system is no longer supported.
-
Automated unit testing of dslash_test and blas_test are now enabled
using ctest. -
Adds support for MPS, enabled through setting the environment
variable QUDA_ENABLE_MPS=1. This allow GPUs to be oversubscribed by
multiple processes, which can improve overall job throughput. -
Implemented self-profiler that builds on top of autotuning
framework. Kernel profile is output to profile_n.tsv, where n=0,
with n incremented with each call to saveProfile (which dumps the
profile to disk). An equivalent algorithm policy profile is output
to profile_async_n.tsv which contains policies such as a complete
dslash. Filename prefix and path can be overridden using
QUDA_PROFILE_OUTPUT_BASE environment variable. -
Implemented simple tracing facility that dumps the flow of kernels
called through a single execution to trace.tsv. Enabled with
environment variable QUDA_ENABLE_TRACE=1. -
Multiple bug fixes and clean up to the library. Many of these are
listed here: https://github.com/lattice/quda/milestone/15?closed=1
Pre-release 0.9 with old MILC interface
QUDA v0.9.0 will introduce a new MILC interface. The development version of MILC at https://github.com/milc-qcd/milc_qcd already uses the new interface.
This version is solely for backwards compatibility. It has been tested using a limited test set.