NEWS

Version 1.1.0 - October 2021

- Add support for NVSHMEM communication for the Dslash operators, for
  significantly improved strong scaling.  See
  https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM for more
  details.

- Addition of the MSPCG preconditioned CG solver for Möbius
  fermions. See
  https://github.com/lattice/quda/wiki/The-Multi-Splitting-Preconditioned-Conjugate-Gradient-(MSPCG),-an-application-of-the-additive-Schwarz-Method
  for more details.

- Addition of the Exact One Flavor Algorithm (EOFA) for Möbius
  fermions.  See
  https://github.com/lattice/quda/wiki/The-Exact-One-Flavor-Algorithm-(EOFA)
  for more details.

- Addition of a fully GPU native Implicitly Restarted Arnoldi
  eigensolver (as opposed to partially relying on ARPACK).  See
  https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#implicitly-restarted-arnoldi-eigensolver
  for more details.

- Significantly reduced latency for reduction kernels through the use
  of heterogeneous atomics.  Requires CUDA 11.0+.

- Addition of support for a split-grid multi-RHS solver.  See
  https://github.com/lattice/quda/wiki/Split-Grid for more details.

- Continued work on enhancing and refining the staggered multigrid
  algorithm.  The MILC interface can now drive the staggered multigrid
  solver.

- Multigrid setup can now use tensor cores on Volta, Turing and Ampere
  GPUs to accelerate the calculation.  Enable with the
  `QudaMultigridParam::use_mma` parameter.

- Improved support of managed memory through the addition of a
  prefetch API.  This can dramatically improve the performance of the
  multigrid setup when oversubscribing the memory.

- Improved the performance of using MILC RHMC with QUDA

- Add support for a new internal data order FLOAT8.  This is the
  default data order for nSpin=4 half and quarter precision fields,
  though the prior FLOAT4 order can be enabled with the cmake option
  QUDA_FLOAT8=OFF.

- Remove of the singularity from the reconstruct-8 and reconstruct-9
  compressed gauge field ordering.  This enables support for free
  fields with these orderings.

- The clover parameter convention has been codified: one can either
  1.) pass in QudaInvertParam::kappa and QudaInvertParam::csw
  separately, and QUDA will infer the necessary clover coefficient, or
  2.) pass an explicit value of QudaInvertParam::clover_coeff
  (e.g. CHROMA's use case) and that will override the above inference.

- QUDA now includes fast-compilation options (QUDA_FAST_COMPILE_DSLASH
  and QUDA_FAST_COMPILE_REUDCE) which enable much faster build times
  for development at the expense of reduced performance.

- Add support for compiling QUDA using clang for both the host and
  device compiler.

- While the bulk of the work associated with making QUDA portable to
  different architectures will form the soul of QUDA 2.0, some of the
  initial refactoring associated with this has been applied.

- Significant cleanup of the tests directory to reduce boiler plate.

- General improvements to the cmake build system using modern cmake
  features.  We now require cmake 3.15.

- Extended the ctest list to include some optional benchmarks.

- Fix a long-standing issue with multi-node Kepler GPU and Intel dual
  socket systems.

- Improved ASAN integration: SANITIZE builds now work out of the box
  with no need to set the ASAN_OPTIONS environment variable.

- Add support for the extended QIO branch (now required for MILC).

- Bump QMP version to 2.5.3.

- Updated to Eigen 3.3.9.

- Multiple bug fixes and clean up to the library.  Many of these are
  listed here: https://github.com/lattice/quda/milestone/24?closed=1

Version 1.0.0 - 10 January 2020

- Add support for CUDA 10.2: QUDA 1.0.0 is supported on CUDA 7.5-10.2
  using either GCC or clang compilers.  CUDA 10.x and either GCC >=
  6.x or clang >= 6.x are highly recommended.

- Significant improvements to the CMake build system and removal of the
  legacy configure build.

- Added more targeted compilation options to constrain which
  precisions and reconstruct types are compiled.  QUDA_PRECISION is a
  cmake parameter that is a 4-bit number corresponding to which
  precisions are enabled, with 1 = quarter, 2 = half, 4 = single and 8
  = double, the default is 14 which enables double, single and half
  precision.  QUDA_RECONSTRUCT is a 3-bit number corresponding to
  which reconstruct types are enabled, with 1 = reconstruct-8/9, 2 =
  reconstruct-12/13 and 4 = reconstruct-18, the default is 7 which
  enables all reconstruct types.

- Completely rewritten all dslash kernels using the accessor
  framework.  This dramatically reduces code complexity and improve
  performance.

- New physics functionality added: gauge Laplace kernel, Gaussian
  quark smearing, topological charge density.

- QUDA can now be built to either utilize texture-memory reads or to
  use direct memory accessing (cmake option QUDA_TEX).  The default
  has textures on, though we note that since Pascal it can be
  advantageous to disable textures and utilize direct reads.

- QUDA is no longer supported on the Fermi generation of GPUs (sm_20
  and sm_21).  Compilation and running should still be possible but
  will require compilation with texture objects disabled.

- Added supported for quarter precision (QUDA_QUARTER_PRECISION) for
  the linear operator and associated solvers.

- Implemented both CA-CG and CA-GCR communication avoid solvers, for
  use either as stand-alone solvers or as a means to accelerate
  multigrid.

- Continued evolution and optimization of the multigrid framework.
  Regardless, we advise users to use the latest develop branch when
  using multigrid, since it continues to be a fast-moving target with
  continual focus on optimization and improvement.

- An implementation of the Thick Restarted Lanczos Method (TRLM) for
  eigenvector solving of the normal operator.

- Lanczos-accelerated multigrid through the use of coarse-grid
  deflation and / or using singular vectors to define the prolongator.

- Removal of the legacy contraction and co-variant derivative
  algorithms, and replacement with accessor-based rewrites.

- Improved heavy-quark residual convergence which ensure correct
  convergence for MILC heavy quark observables.

- Experimental support for Just-In-Time (JIT) compilation using Jitify.

- Significantly improved unit testing framework using ctest.

- QUDA can now be built to target Google's address sanitizer
  (CMAKE_BUILD_TYPE option is SANITIZE) for improved debugging.

- QUDA can now download and install the USQCD libraries QMP and QIO
  automatically as part of the compilation process.  To enable this,
  the option QUDA_DOWNLOAD_USQCD=ON should be set.  Similarly to Eigen
  installation this requires access to the outside internet.

- QUDA can now download and install the ARPACK library automatically
  if the QUDA_DOWNLOAD_ARPACK option is enabled.

- Updated to CUB 1.8.

- Multiple bug fixes and clean up to the library. Many of these are
  listed here: https://github.com/lattice/quda/milestone/21?closed=1

Version 0.9.0 - 24 July 2018

- Add support for CUDA 9.x: QUDA 0.9.0 is supported on CUDA 7.0-9.2.

- Continued focus on optimization of multi-GPU execution, with
  particular emphasis on Dslash scaling.  For more details on
  optimizing multi-GPU performance, see
  https://github.com/lattice/quda/wiki/Multi-GPU-Support

- On systems that support it, QUDA now uses direct peer-to-peer
  communication between GPUs with in the same node.  The Dslash policy
  autotuner will ascertain the optimal communication route to take,
  whether it be to route through CPU memory, use DMA copy engines or
  directly write the halo buffer to neighboring GPUs. 

- On systems that support it, QUDA will take advantage of GPU Direct
  RDMA.  This is enabled through setting the environment variable
  QUDA_ENABLE_GDR=1 which will augment the dslash tuning policies to
  include policies using GPU-aware MPI to facilitate direct GPU-NIC
  communication.  This can improve strong scaling by up to 3x.

- Improved precision when using half precision (use rounding instead
  of truncation when converting to/from float).

- Add support for symmetric preconditioning for 4-d preconditioned
  Shamir and Möbius Dirac operators.

- Added initial support for multi-right-hand-side staggered Dirac
  operator (treat the rhs index as a fifth dimension).

- Added initial implementation of block CG linear solver.

- Added BiCGStab(l) linear solver.  The parameter "l" corresponds to
  the size of the space to perform GCR-style residual minimization.
  This is typically much better behaved than BiCGStab for the Wilson
  and Wilson-clover linear systems.

- Initial version of adaptive multigrid fully implemented into QUDA.

- Creation of multi-blas and multi-reduction framework, this is
  essential for high performance for pipelined, block and
  communication-avoiding solvers that work on "matrices of vectors" as
  opposed to "scalars of vectors".  The max tile size used by the
  multi-blas framework is set by QUDA_MAX_MULTI_BLAS_N cmake
  parameter, which default to 4 for reduced compile time.  For
  production use of such solvers, this should be increase to 8..16.

- Optimization of multi-shift solver using multi-blas framework to permit
  kernel fusion of all shift updates.

- Complete rewrite and optimization of clover inversion, HISQ force
  kernels, HISQ link fattening algorithms using accessors.

- QUDA can now directly load/store from MILC's site structure array.
  This removes the need to unpack and pack data prior to calling QUDA,
  and dramatically reduces CPU overhead.

- Removal of legacy data structures and kernels.  In particular
  original single-GPU only ASQTAD fermion force has been removed.

- Implementation of STOUT fattening kernel.

- Significant improvement to the cmake build system to improve
  compilation speed and aid productivity.  In particular, QUDA now
  supports being built as a shared library which greatly reduces link
  time.

- Autoconf and configure build system is no longer supported.

- Automated unit testing of dslash_test and blas_test are now enabled
  using ctest.

- Adds support for MPS, enabled through setting the environment
  variable QUDA_ENABLE_MPS=1.  This allow GPUs to be oversubscribed by
  multiple processes, which can improve overall job throughput.

- Implemented self-profiler that builds on top of autotuning
  framework.  Kernel profile is output to profile_n.tsv, where n=0,
  with n incremented with each call to saveProfile (which dumps the
  profile to disk).  An equivalent algorithm policy profile is output
  to profile_async_n.tsv which contains policies such as a complete
  dslash.  Filename prefix and path can be overridden using
  QUDA_PROFILE_OUTPUT_BASE environment variable.

- Implemented simple tracing facility that dumps the flow of kernels
  called through a single execution to trace.tsv.  Enabled with
  environment variable QUDA_ENABLE_TRACE=1.

- Multiple bug fixes and clean up to the library. Many of these are
  listed here: https://github.com/lattice/quda/milestone/15?closed=1

Version 0.8.0 - 1st February 2016

- Removed all Tesla-generation GPU support from QUDA (sm_1x).  As a
  result, QUDA now requires a minimum of the Fermi-generation GPUs.

- Added support for building QUDA using cmake.  This gives a much more
  flexible and extensible build system as well as allowing
  out-of-source-directory building. For details see:
  https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake

- Improved strong scaling of the multi-shift solver by overlapping the
  shift updates with the subsequent iteration's dslash comms waiting.

- Improved performance of multi-shift solver by preventing unnecessary
  refinement of shifted solutions once the residual falls below
  floating point precision.

- Significantly improved performance of FloatNOrder accessor functors
  to ensure vectorized memory accesses as well as removal of
  unnecessary type conversions.  This gives a significant speedup to
  all algorithms that use these accessors.

- Significant improvement in compilation time using C++ traits to
  prune build options.

- Added support for gauge-field reconstruction to naive staggered
  fermions.

- Added hyper-cubic random number generator with multi-GPU support.

- Added topological charge computation.

- Added final computational routines to allow for complete off-load of
  MILC staggered RHMC to QUDA (momActionQuda - compute the momentum
  contribution to the action, projectSU3Quda - project the gauge field
  back onto the SU(3) manifold).

- In the MILC interface staggered solver, the resident gauge field is
  reused until it is invalidated by constructing new links (or
  overridden with the `num_iters` back door flag).

- Improved gauge field unitarization robustness and added check for
  NaN in the results.

- Some cleanup and kernel-fusion optimization of gauge force HISQ
  force kernels.  This also improves compilation time and reduces
  library size.

- Added support for imaginary chemical potential to the staggered phase
  application / removal kernel, as well as fixing bugs in this
  routine.

- Algorithms that previously used double-precision atomics now use a
  cub reduction.  This drastically improves performance of such
  routines.

- QUDA can now be configured to enable NVTX markup on the TimeProfile
  class and MILC interface to give improved visual profiling.

- All gauge field copies now check for NaN when `HOST_DEBUG=yes` to
  improve debugging.

- Set tunecache.tsv to be invalid if git id changes to ensure a valid
  tune cache is used.

- Reduced BLAS tuning overhead, by setting the maximum grid size to be
  twice the SM count to avoid an unnecessarily large parameter sweep.

- Added new profile that records total time spent in QUDA.

- Fixed bugs in long-link field generation.

- Multiple bug fixes to the library. Many of the fixes are listed here:
  https://github.com/lattice/quda/pulls?q=is%3Apr+is%3Aclosed+milestone%3A%22QUDA+0.8.0%22
  https://github.com/lattice/quda/issues?q=is%3Aissue+milestone%3A%22QUDA+0.8.0%22+is%3Aclosed


Version 0.7.2 - 07th October 2015

- Add support for separate temporal-spatial plaquette

- Fixed memory leak in MPI communications

- Fixed issues with assignment of GPUs to processes when using the QMP
  backend with multiple nodes with multiple GPUs

- Fixed bug in MR solver which led to incorrect convergence

- Similar to the NVTX markup support for MPI added in 0.7.1 we now
  support NVTX markup for calls to the MILC interface. Enabled by using
  "--enable-milc-nvtx" when configuring QUDA.

- Multiple bug fixes to the library. Many of the fixes are listed here:
  https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.2%22+is%3Aclosed


Version 0.7.1 - 11th June 2015

- Added Maxwell-generation GPU support.

- Added automatic support for NVTX markup of MPI calls for visualizing
  MPI calls in the visual profiler.  Enabled by using
  "--enable-mpi-nvtx" when configuring QUDA.

- Modified clover derivative code to use gauge::FloatNorder structs,
  which in the process adds support for different reconstruct types.

- Added autotuning support to clover derivative and sigma trace
  computations.

- Multiple fixes and improvements to GPU_COMMS feature of QUDA: fixed
  a bug when using full-field fermions, improved support on Cray
  systems, and added much more robust memory checking of message memory when
  host debugging is enabled.

- Multi-GPU dslash now correctly report flops and bandwidth when
  autotuning.

- Fixed a bug where by the 5-d domain wall dslash was called twice
  every time it was called.

- Fixed a bug when using both improved staggered fermions and naive
  staggered fermions with auto-tuning enabled.

- Fixed a bug with using fused exterior kernels with auto tuning that
  could result in incorrect results.

- To aid debugging, QUDA now prints its version, including a git id
  tag, when initialized.

- Drastically improved Doxygen markup of the MILC interface.

- Multiple bug fixes that affects stability and correctness throughout
  the library.  Many of these fixes are listed here:
  https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.1%22+is%3Aclosed


Version 0.7.0 - 4th February 2015

- Added support for twisted-clover, 4-d preconditioned domain wall and
  4-d preconditioned Mobius fermions.

- Reworked auto-tuning framework to drastically reduce the lookup
  overhead of querying the tune cache.  This has the effect of
  improving the strong scaling (greater than 10% improvement in solver
  performance seen at scale).

- Support for GPU-aware MPI and GPUDirect RDMA for faster multi-GPU
  communication.  This option is enabled using the --enable-gpu-comms
  option (GPU_COMMS in make.in), and requires a GPU-aware MPI stack
  (MVAPICH or OpenMPI).

- Reduction in communication latency for half-precision dslash through
  merging the main quark and norm fields into a contiguous buffer for
  host to device transfers.  This reduces API overhead and increases
  sustained PCIe bandwidth.

- Added support for double buffering of the MPI receive buffers in the
  multi-GPU dslash to allow for early preposting of MPI_Recv.

- Implemented an initial multi-threaded dslash (parallelizing between
  MPI and CUDA API calls) to reduce overall CPU frequency sensitivity.
  This implementation is embryonic: it simply provides for early
  preposting of MPI_Recv and will be extended to parallelize between
  MPI_Test and CUDA event querying.

- Added an alternative multi-GPU dslash where the update of the
  boundary regions is deployed in a single kernel after all
  communication is complete.  This reduces kernel launch overhead and
  ensures communication is done with maximum priority.

- Reworked multi-GPU dslash interface: there are now different
  policies supported for a variety of execution flows.  Supported
  policies at the moment are QUDA_DSLASH (legacy multi-gpu that
  utilizes face buffers for communication buffers), QUDA_DSLASH2 (the
  default - regular multi-GPU dslash with CPU-routed communication),
  QUDA_FUSED_DSLASH (use a single kernel to update all boundaries
  after all communication has finished), QUDA_GPU_COMMS_DSLASH (all
  communication emanates directly from GPU memory locations),
  QUDA_PTHREADS_DSLASH (multi-threaded dslash).  This can be described
  as experimental, and changing the policy type has yet to be exposed
  to the interface.

- New routines for construction of the clover matrix field and
  inversion of the clover matrices (with optional computation of the
  trace log of the clover field).  Presently exposed by using
  loadCloverQuda with NULL pointers to host fields to force
  construction instead of download of the clover field.

- Implemented support for exact momentum exponentiation to
  complement the pre-existing Taylor expanded variant
  (updateGaugeFieldQuda).

- Partial implementation of the clover-field force terms
  (clover_deriv_quda.cu and clover_trace_quda.cu).

- All extended gauge field creation routines have been offloaded to
  QUDA, minimizing PCIe traffic and minimizing CPU time.  This has
  lead to a significant speedup in routines that need this, e.g., the
  gauge force.

- Initial support for extended fermion-field creation routines (only
  supports staggered fields).

- Fermion field outer product implemented in QUDA.  Only exposed for
  staggered fermions at present (computeStaggeredOprodQuda).

- EigCG eigenvector deflation algorithm and subsequent initCG
  implemented for the preconditioned normal operator.  Added a
  deflation_test to demonstrate the use of this algorithm.

- Implemented Lanczos eigenvector solver (no unit test yet for
  demonstrating this - presently only hooked into the CPS).

- Implemented initial support for communication-avoiding s-step
  solvers: CG - QUDA_MPCG_INVERTER and BiCGstab -
  QUDA_MPBICGSTAB_INVERTER.  Only proof of concept at the moment and
  need to be optimized.

- Implemented initial support for overlapping domain-decomposition
  preconditioners.  Presently only proof of concept and needs further
  development.

- Implemented initial support for applying different phases to a gauge
  field.  Presently only proof of concept and needs further
  development.  Will be useful for minimizing memory and PCIe traffic
  in staggered HMC.

- Implemented support for computation of the gauge field plaquette.

- Implemented initial support for fermion-field contractions.

- Added support for the CGNE solver, to complement the already
  existing CGNR.

- Improvements to stability and robustness of the solvers in mixed
  precision.  QUDA will default to always using a high precision
  solution accumulator since this drastically improves convergence,
  especially using half precision.

- Improved the stability and robustness of CG when used in combination
  with the Fermilab heavy-quark residual stopping criterion.  This has
  been validated against the MILC implementation.

- Separated dslash_quda.cu into multiple files to allow for parallel
  building to increase compilation speed.

- Added interface support for Luescher's chiral basis for fermion
  fields: page 11 of doc/dirac.ps in the DD-HMC code package
  http://luscher.web.cern.ch/luscher/DD-HMC.  This is selected through
  setting QudaInvertParam::gamma_basis = QUDA_CHIRAL_GAMMA_BASIS.

- QUDA will now complain and exit if it detects that a stale tunecache
  is being used.

- Removed official support for obsolete compute capabilities 1.1 and
  1.2.  This makes the minimum supported device compute capability 1.3
  (GT200).

- Multiple bug fixes that affects stability and correctness throughout
  the library.  Many of these fixes are listed here:
  https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.0+%22+is%3Aclosed.

- Although not strictly related to this release, we have started to
  collect common running settings and hints in the QUDA wiki:
  https://github.com/lattice/quda/wiki.


Version 0.6.1 - 10th March 2014

- All unit tests now enable/disable CPU-side verification with the "--verify
  true/false" flag.  The default is true.

- The google test API is now used in some of the unit tests
  (dslash_test, staggered_dslash_test and blas_test).  (Eventually all
  unit tests will be built using this.)

- Various bugs have been fixed in fermion_force_test,
  hisq_paths_force_test, hisq_unitarize_force_test and
  unitarize_link_test

Version 0.6.0 - 23rd January 2014

- Support for reconstruct 9/13 for the long link in HISQ fermions.
  This provides up to a 25% speedup over using no reconstruction.
  Owing to architecture constraints, reconstruct 9/13 is not supported
  on "Tesla" architectures, and is only supported on superseding
  architectures (Fermi, Kepler, etc.).

- Implemented the long link calculation for HISQ and asqtad fermions.
  This has the net result of speeding up the gauge fattening by
  about a factor 1.6x.

- Implemented a gauge field update routine that evolves the gauge
  field by a given step size using a momentum field.  This is exposed
  as the function updateGaugeFieldQuda(...).

- Added support for qdpjit field ordering.  When used in conjunction
  with the device interface, this allows Chroma (when compiled using
  qdpjit) to avoid all CPU <-> GPU transfers.

- Completely rewritten gauge and clover field copying routines using a
  generic template-driven approach.  Due the large number of possible
  input / output combinations to keep the compilation time under
  control, the different interfaces need to be opted in at configure
  time (MILC and QDP interfaces are enabled by default).

- The QUDA interface (loadGaugeQuda, loadCloverQuda, invertQuda and
  invertMultishiftQuda) now supports device-side pointers as well as
  host-size pointers.  The location of a given pointer is set by the
  QudaFieldLocation members of QudaGaugeParam (location) and
  QudaInvertParam (input_location, output_location, clover_location).

- Added new interface support for QDPJIT ordered fields (dirac, clover
  and gauge fields).

- When doing mixed-precision solvers, all low-precision copies of
  gauge and clover fields are created from the pre-existing GPU copies
  instead of re-copying from the CPU.  This lowers the PCIe overhead
  by up to 1.75x.

- Significantly improved performance of both degenerate and
  non-degenerate twisted-mass CG solver (up to 17% and 32%, respectively).

- ColorSpinorField is now derived from LatticeField, with all
  LatticeField derivations now using common page-locked and device memory
  buffers.  This has the effect of reducing the overall packed-locked
  memory footprint.

- The source vector is now scaled such that it is equal to unity.
  This prevents underflow from occurring when the source vector is too
  small.

- Fixed double precision definition of *= vector operator, which caused a
  truncation to single precision for certain solver types.

- Fixed memory over allocation when doing clover fermions in half precision.

- Memory leak fix to clover fermions.

- Added work around to allow QUDA to compile with GCC 4.7.x.

- Many small fixes and overall code cleanup.

Version 0.5.0 - 20 March 2013

- Added full support for CUDA 5.0, including the Tesla K20 and other
  GK110 ("Kepler 2") GPUs.  QUDA has yet to be fully optimized for
  GK110, however.

- Added multi-GPU support for the domain wall action, to be further
  optimized in a future release.

- Added official support for the QDP-JIT library, enabled via the
  "--enable-qdp-jit" configure option.  With the combination of QUDA
  and QDP-JIT, Chroma runs almost entirely on the GPU.

- Added a fortran interface, found in include/quda_fortran.h and
  lib/quda_fortran.F90.

- QUDA is now compatible with the Berlin QCD (BQCD) package,
  supporting both Wilson and Clover solvers, including support for
  multiple GPUs.  This currently requires a specific branch of BQCD
  (https://github.com/lattice/bqcd-r399-quda).

- Added a new interface function, initCommsGridQuda(), for declaring
  the mapping of MPI ranks (or QMP node IDs) to the logical grid used
  for communication.  This finally completes the MPI interface, which
  previously relied on an undocumented function internal to QUDA.

- Added a new interface function, setVerbosityQuda(), to allow for
  finer-grained control of status reporting.  See the description in
  include/quda.h for details.

- Merged wilson_dslash_test and domain_wall_dslash_test together into
  a unified dslash_test, and likewise for invert_test.  The staggered
  tests are still separate for now.

- Moved all internal symbols behind a namespace, "quda", for better
  insulation from external applications and libraries.

- Vastly improved the stability and accuracy of the multi-shift CG
  solver.  The invertMultiShiftQuda() interface function now supports
  mixed precision and implements per-shift refinement after the
  multi-shift solver completes to ensure accuracy of the final result.
  The old invertMultiShiftQudaMixed() interface function has been
  removed.  In addition, the multi-shift solver now supports setting
  the convergence tolerance on a per-pole basis via the tol_offset[]
  member of QudaInvertParam.

- Improved the stability and accuracy of mixed-precision CG.  As a
  result, mixed double/single CG yields a virtually identical iteration
  count to pure double CG, and using half precision is now a win.

- Added support for the Fermilab heavy-quark residual as a stopping
  condition in BiCGstab, CG, and GCR.  To minimize the impact on
  performance, the heavy-quark residual is only measured every 10
  iterations (for BiCGstab and CG) or only when the solution is computed
  (for GCR).  This stopping condition has also been incorporated into the
  sequential CG refinement stage of the multi-shift solver.  The
  tolerance for the heavy-quark residual is set via the "tol_hq"
  member of QudaInvertParam (and "tol_hq_offset" for the
  multi-shift solver).  The "residual_type" member selects the
  desired stopping condition(s): L2 relative residual, Fermilab
  heavy-quark residual, or both.  Note that the heavy-quark residual
  is not supported on cards with compute capability 1.1, 1.2, or 1.3
  (i.e., those predating the "Fermi" architecture) due to hardware
  limitations.

- The value of the true residual(s) are now returned in the true_res
  and (for multi-shift) true_res_offset members of the QudaInvertParam
  struct.  When using heavy quark residual stopping condition, the
  true_res_hq and true_res_hq_offset members are additionally filled
  with the heavy-quark residual value(s).

- The BiCGstab solver now supports an initial-guess strategy.  This is
  presently only supported when employing a one-pass solve and does
  not yet work for a two-pass solve (e.g., of the normal equations).

- Enabled by default double-precision textures since the Fermi double
  precision instability has been fixed in the driver accompanying the
  CUDA 5.0 production release.

- Fixed a bug related to the sharing of page-locked (pinned) memory
  between CUDA and Infiniband that affected correct operation of both
  Chroma and MILC on some systems.

- Renamed the "QUDA_NORMEQ_SOLVE" solve_type to "QUDA_NORMOP_SOLVE",
  and likewise for "QUDA_NORMOP_PC_SOLVE".  This better reflects their
  behavior, since a "NORMOP" solve will always involve the normal operator
  (A^dag A) but might not correspond to solving the normal equations
  of the original system.

- Fixed a long-standing issue so that solve_type and solution_type are
  now interpreted as described in the NEWS entry for QUDA 0.3.0 below.
  More specifically,

    solution_type specifies *what* linear system is to be solved.
    solve_type specifies *how* the linear system is to be solved.

    We have the following four cases (plus preconditioned variants):

    solution_type    solve_type    Effect
    -------------    ----------    ------
    MAT              DIRECT        Solve Ax=b
    MATDAG_MAT       DIRECT        Solve A^dag y = b, followed by Ax=y
    MAT              NORMOP        Solve (A^dag A) x = (A^dag b)
    MATDAG_MAT       NORMOP        Solve (A^dag A) x = b

    An even/odd preconditioned (PC) solution_type generally requires a PC
    solve_type and vice versa.  As an exception, the un-preconditioned
    MAT solution_type may be used with any solve_type, including
    DIRECT_PC and NORMOP_PC.

    As also noted in the entry for 0.3.0 below, with the CG inverter,
    solve_type should generally be set to 'QUDA_NORMOP_PC_SOLVE',
    which will solve the even/odd-preconditioned normal equations via
    CGNR.  (The full solution will be reconstructed if necessary based
    on solution_type.)  For BiCGstab (with Wilson or Wilson-clover
    fermions), 'QUDA_DIRECT_PC_SOLVE' is generally best.

- General cleanup and other minor fixes.  See
  https://github.com/lattice/quda/issues?milestone=7 for a breakdown
  of all issues closed in this release.


Version 0.4.0 - 4 April 2012

- CUDA 4.0 or later is now required to build the library.

- The "make.inc.example" template has been replaced by a configure script.
  See the README file for build instructions and "configure --help" for
  a list of configure options.

- Emulation mode is no longer supported.

- Added support for using multiple GPUs in parallel via MPI or QMP.
  This is supported by all solvers for the Wilson, clover-improved
  Wilson, twisted mass, and improved staggered fermion actions.
  Multi-GPU support for domain wall will be forthcoming in a future
  release.

- Reworked auto-tuning so that BLAS kernels are tuned at runtime,
  Dirac operators are also tuned, and tuned parameters may be cached
  to disk between runs.  Tuning is enabled via the "tune" member of
  QudaInvertParam and is essential for achieving optimal performance
  in the solvers.  See the README file for details on enabling
  caching, which avoids the overhead of tuning for all but the first
  run at a given set of parameters (action, precision, lattice volume,
  etc.).

- Added NUMA affinity support.  Given a sufficiently recent Linux
  kernel and a system with dual I/O hubs (IOHs), QUDA will attempt to
  associate each GPU with the "closest" socket.  This feature is
  disabled by default under OS X and may be disabled under Linux via
  the "--disable-numa-affinity" configure flag.

- Improved stability on Fermi-based GeForce cards by disabling double
  precision texture reads.  These may be re-enabled on Fermi-based
  Tesla cards for improved performance, as described in the README
  file.

- As of QUDA 0.4.0, support has been dropped for the very first
  generation of CUDA-capable devices (implementing "compute
  capability" 1.0).  These include the Tesla C870, the Quadro FX 5600
  and 4600, and the GeForce 8800 GTX.

- Added command-line options for most of the tests.  See, e.g.,
  "wilson_dslash_test --help"

- Added CPU reference implementations of all BLAS routines, which allows
  tests/blas_test to check for correctness.

- Implemented various structural and performance improvements
  throughout the library.

- Deprecated the QUDA_VERSION macro (which corresponds to an integer
  in octal).  Please use QUDA_VERSION_MAJOR, QUDA_VERSION_MINOR, and
  QUDA_VERSION_SUBMINOR instead.


Version 0.3.2 - 18 January 2011

- Fixed a regression in 0.3.1 that prevented the BiCGStab solver from
  working correctly with half precision on Fermi.


Version 0.3.1 - 22 December 2010

- Added support for domain wall fermions.  The length of the fifth
  dimension and the domain wall height are set via the 'Ls' and 'm5'
  members of QudaInvertParam.  Note that the convention is to include
  the minus sign in m5 (e.g., m5 = -1.8 would be a typical value).

- Added support for twisted mass fermions.  The twisted mass parameter
  and flavor are set via the 'mu' and 'twist_flavor' members of
  QudaInvertParam.  Similar to clover fermions, both symmetric and
  asymmetric even/odd preconditioning are supported.  The symmetric
  case is better optimized and generally also exhibits faster
  convergence.

- Improved performance in several of the BLAS routines, particularly
  on Fermi.

- Improved performance in the CG solver for Wilson-like (and domain
  wall) fermions by avoiding unnecessary allocation and deallocation
  of temporaries, at the expense of increased memory usage.  This will
  be improved in a future release.

- Enabled optional building of Dirac operators, set in make.inc, to
  keep build time in check.

- Added declaration for MatDagMatQuda() to the quda.h header file and
  removed the non-existent functions MatPCQuda() and
  MatPCDagMatPCQuda().  The latter two functions have been absorbed
  into MatQuda() and MatDagMatQuda(), respectively, since
  preconditioning may be selected via the solution_type member of
  QudaInvertParam.

- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
  prevented the use of MatPC solution types.

- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
  would cause a crash when QUDA_MASS_NORMALIZATION is used.

- Fixed an allocation bug in the Wilson and Wilson-clover
  Dirac operators that might have led to undefined behavior for
  non-zero padding.

- Fixed a bug in blas_test that might have led to incorrect autotuning
  for the copyCuda() routine.

- Various internal changes: removed temporary cudaColorSpinorField
  argument to solver functions; modified blas functions to use C++
  complex<double> type instead of cuDoubleComplex type; improved code
  hygiene by ensuring that all textures are bound in dslash_quda.cu
  and unbound after kernel execution; etc.


Version 0.3.0 - 1 October 2010

- CUDA 3.0 or later is now required to build the library.

- Several changes have been made to the interface that require setting
  new parameters in QudaInvertParam and QudaGaugeParam.  See below for
  details.

- The internals of QUDA have been significantly restructured to facilitate
  future extensions.  This is an ongoing process and will continue
  through the next several releases.

- The inverters might require more device memory than they did before.
  This will be corrected in a future release.

- The CG inverter now supports improved staggered fermions (asqtad or
  HISQ).  Code has also been added for asqtad link fattening, the asqtad
  fermion force, and the one-loop improved Symanzik gauge force, but
  these are not yet exposed through the interface in a consistent way.

- A multi-shift CG solver for improved staggered fermions has been
  added, callable via invertMultiShiftQuda().  This function does not
  yet support Wilson or Wilson-clover.

- It is no longer possible to mix different precisions for the
  spinors, gauge field, and clover term (where applicable).  In other
  words, it is required that the 'cuda_prec' member of QudaGaugeParam
  match both the 'cuda_prec' and 'clover_cuda_prec' members of
  QudaInvertParam, and likewise for the "sloppy" variants.  This
  change has greatly reduced the time and memory required to build the
  library.

- Added 'solve_type' to QudaInvertParam.  This determines how the linear
  system is solved, in contrast to solution_type which determines what
  system is being solved.  When using the CG inverter, solve_type should
  generally be set to 'QUDA_NORMEQ_PC_SOLVE', which will solve the
  even/odd-preconditioned normal equations via CGNR.  (The full
  solution will be reconstructed if necessary based on solution_type.)
  For BiCGStab, 'QUDA_DIRECT_PC_SOLVE' is generally best.  These choices
  correspond to what was done by default in earlier versions of QUDA.

- Added 'dagger' option to QudaInvertParam.  If 'dagger' is set to
  QUDA_DAG_YES, then the matrices appearing in the chosen solution_type
  will be conjugated when determining the system to be solved by
  invertQuda() or invertMultiShiftQuda().  This option must also be set
  (typically to QUDA_DAG_NO) before calling dslashQuda(), MatPCQuda(),
  MatPCDagMatPCQuda(), or MatQuda().

- Eliminated 'dagger' argument to dslashQuda(), MatPCQuda(), and MatQuda()
  in favor of the new 'dagger' member of QudaInvertParam described above.

- Removed the unused blockDim and blockDim_sloppy members from
  QudaInvertParam.

- Added 'type' parameter to QudaGaugeParam.  For Wilson or Wilson-clover,
  this should be set to QUDA_WILSON_LINKS.

- The dslashQuda() function now takes takes an argument of type
  QudaParityType to determine the parity (even or odd) of the output
  spinor.  This was previously specified by an integer.

- Added support for loading all elements of the gauge field matrices,
  without SU(3) reconstruction.  Set the 'reconstruct' member of
  QudaGaugeParam to 'RECONSTRUCT_NO' to select this option, but note
  that it should not be combined with half precision unless the
  elements of the gauge matrices are bounded by 1.  This restriction
  will be removed in a future release.

- Renamed dslash_test to wilson_dslash_test, renamed invert_test to
  wilson_invert_test, and added staggered variants of these test
  programs.

- Improved performance of the half-precision Wilson Dslash.

- Temporarily removed 3D Wilson Dslash.

- Added an 'OS' option to make.inc.example, to simplify compiling for
  Mac OS X.


Version 0.2.5 - 24 June 2010

- Fixed regression in 0.2.4 that prevented the library from compiling
  when GPU_ARCH was set to sm_10, sm_11, or sm_12.


Version 0.2.4 - 22 June 2010

- Added initial support for CUDA 3.x and Fermi (not yet optimized).

- Incorporated look-ahead strategy to increase stability of the BiCGStab
  inverter.

- Added definition of QUDA_VERSION to quda.h.  This is an integer with
  two digits for each of the major, minor, and subminor version
  numbers.  For example, QUDA_VERSION is 000204 for this release.


Version 0.2.3 - 2 June 2010

- Further improved performance of the blas routines.

- Added 3D Wilson Dslash in anticipation of temporal preconditioning.


Version 0.2.2 - 16 February 2010

- Fixed a bug that prevented reductions (and hence the inverter) from working
  correctly in emulation mode.


Version 0.2.1 - 8 February 2010

- Fixed a bug that would sometimes cause the inverter to fail when spinor
  padding is enabled.

- Significantly improved performance of the blas routines.


Version 0.2 - 16 December 2009

- Introduced new interface functions newQudaGaugeParam() and
  newQudaInvertParam() to allow for enhanced error checking.  See
  invert_test for an example of their use.

- Added auto-tuning blas to improve performance (see README for details).

- Improved stability of the half precision 8-parameter SU(3)
  reconstruction (with thanks to Guochun Shi).

- Cleaned up the invert_test example to remove unnecessary dependencies.

- Fixed bug affecting saveGaugeQuda() that caused su3_test to fail.

- Tuned parameters to improve performance of the half-precision clover
  Dslash on sm_13 hardware.

- Formally adopted the MIT/X11 license.


Version 0.1 - 17 November 2009

- Initial public release.