Extend GPU documentation (#1073)

* Extend GPU documentation * fix * GPU doc up
JuliaMolSim · Mar 4, 2025 · 7cce302 · 7cce302
1 parent d22f8ae
commit 7cce302
Show file tree

Hide file tree

Showing 7 changed files with 101 additions and 19 deletions.
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,4 +1,5 @@
 [deps]
+AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
 ASEconvert = "3da9722f-58c2-4165-81be-b4d7253e8fd2"
 Artifacts = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"
 AtomsBase = "a963bdd2-2df7-4f54-a1ee-49d51e6be12a"
@@ -7,6 +8,7 @@ AtomsCalculators = "a3e0e189-c65a-42c1-833c-339540406eb1"
 AtomsIO = "1692102d-eeb4-4df9-807b-c9517f998d44"
 AtomsIOPython = "9e4c859b-2281-48ef-8059-f50fe53c37b0"
 Brillouin = "23470ee3-d0df-4052-8b1a-8cbd6363e7f0"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 DFTK = "acf6eb54-70d9-11e9-0013-234b7a5f5337"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210"

diff --git a/docs/make.jl b/docs/make.jl
@@ -62,6 +62,7 @@ PAGES = [
         # reliability of calculations.
         "tricks/achieving_convergence.md",
         "tricks/parallelization.md",
+        "tricks/gpu.jl",
         "tricks/scf_checkpoints.jl",
         "tricks/compute_clusters.md",
     ],

diff --git a/docs/src/features.md b/docs/src/features.md
@@ -16,6 +16,7 @@
     - Direct minimization, Newton solver
     - Multi-level threading (``k``-points eigenvectors, FFTs, linear algebra)
     - MPI-based distributed parallelism (distribution over ``k``-points)
+    - [Using DFTK on GPUs](@ref): Nvidia *(mostly supported)* and AMD GPUs *(preliminary support)*
     - Treat systems of 1000 electrons
 
 * Ground-state properties and post-processing:

diff --git a/docs/src/guide/tutorial.jl b/docs/src/guide/tutorial.jl
@@ -116,3 +116,17 @@ plot_dos(bands; temperature=1e-3, smearing=Smearing.FermiDirac())
 # Note that directly employing the `scfres` also works, but the results
 # are much cruder:
 plot_dos(scfres; temperature=1e-3, smearing=Smearing.FermiDirac())
+
+# !!! info "Where to go from here"
+#     - **Background on DFT:**
+#       * [Periodic problems](@ref periodic-problems),
+#       * [Introduction to density-functional theory](@ref),
+#       * [Self-consistent field methods](@ref)
+#     - **Running calculations:**
+#       * [Temperature and metallic systems](@ref metallic-systems)
+#       * [Performing a convergence study](@ref)
+#       * [Geometry optimization](@ref)
+#     - **Tips and tricks:** 
+#       * [Using DFTK on compute clusters](@ref),
+#       * [Using DFTK on GPUs](@ref),
+#       * [Saving SCF results on disk and SCF checkpoints](@ref)
diff --git a/docs/src/tricks/gpu.jl b/docs/src/tricks/gpu.jl
@@ -0,0 +1,78 @@
+# # Using DFTK on GPUs
+#
+# In this example we will look how DFTK can be used on
+# Graphics Processing Units.
+# In its current state runs based on Nvidia GPUs
+# using the [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) Julia
+# package are better supported and there are considerably less rough
+# edges.
+#
+# !!! info "GPU parallelism not supported everywhere"
+#     GPU support is still a relatively new feature in DFTK.
+#     While basic SCF computations and e.g. forces are supported,
+#     this is not yet the case for all parts of the code.
+#     In most cases there is no intrinsic limitation and typically it only takes
+#     minor code modification to make it work on GPUs.
+#     If you require GPU support in one of our routines, where this is not
+#     yet supported, feel free to open an issue on github or otherwise get in touch.
+#
+
+using AtomsBuilder
+using DFTK
+using PseudoPotentialData
+
+# **Model setup.** First step is to setup a [`Model`](@ref) in DFTK.
+# This proceeds exactly as in the standard CPU case
+# (see also our [Tutorial](@ref)).
+
+silicon = bulk(:Si)
+
+model  = model_DFT(silicon;
+                   functionals=PBE(),
+                   pseudopotentials=PseudoFamily("dojo.nc.sr.pbe.v0_4_1.standard.upf"))
+nothing  # hide
+
+# Next is the selection of the computational architecture.
+# This effectively makes the choice, whether the computation will be run
+# on the CPU or on a GPU.
+#
+# **Nvidia GPUs.**
+# Supported via [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl).
+# Right now `Libxc` only supports CUDA 11,
+# so we need to explicitly request the 11.8 CUDA runtime:
+using CUDA
+CUDA.set_runtime_version!(v"11.8")  # Note: This requires a restart of Julia
+architecture = DFTK.GPU(CuArray)
+
+# **AMD GPUs.** Supported via [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl).
+# Here you need to [install ROCm](https://rocm.docs.amd.com/) manually.
+# With that in place you can then select:
+
+using AMDGPU
+architecture = DFTK.GPU(ROCArray)
+
+# **Portable architecture selection.**
+# To make sure this script runs on the github CI (where we don't have GPUs
+# available) we check for the availability of GPUs before selecting an
+# architecture:
+
+architecture = has_cuda() ? DFTK.GPU(CuArray) : DFTK.CPU()
+
+# **Basis and SCF.**
+# Based on the `architecture` we construct a [`PlaneWaveBasis`](@ref) object
+# as usual:
+
+basis  = PlaneWaveBasis(model; Ecut=30, kgrid=(5, 5, 5), architecture)
+nothing  # hide
+
+# ... and run the SCF and some post-processing:
+
+scfres = self_consistent_field(basis; tol=1e-6)
+compute_forces(scfres)
+
+# !!! warning "GPU performance"
+#     Our current (February 2025) benchmarks show DFTK to have reasonable performance
+#     on Nvidia / CUDA GPUs with a 50-fold to 100-fold speed-up over single-threaded
+#     CPU execution. However, support on AMD GPUs has been less benchmarked and
+#     there are likely rough edges. Overall this feature is relatively new
+#     and we appreciate any experience reports or bug reports.
diff --git a/docs/src/tricks/parallelization.md b/docs/src/tricks/parallelization.md
@@ -80,7 +80,7 @@ and secondly multiprocessing using MPI
 (via the [MPI.jl](https://github.com/JuliaParallel/MPI.jl) Julia interface).
 MPI-based parallelism is currently only over ``k``-points,
 such that it cannot be used for calculations with only a single ``k``-point.
-Otherwise combining both forms of parallelism is possible as well.
+There is also support for [Using DFTK on GPUs](@ref).
 
 The scaling of both forms of parallelism for a number of test cases
 is demonstrated in the following figure.
@@ -126,10 +126,10 @@ DFTK.mpi_master() || (redirect_stdout(); redirect_stderr())
 ```
 at the top of your script to disable printing on all processes but one.
 
-!!! info "MPI-based parallelism not fully supported"
-    While standard procedures (such as the SCF or band structure calculations)
-    fully support MPI, not all routines of DFTK are compatible with MPI yet
-    and will throw an error when being called in an MPI-parallel run.
+!!! info "MPI-based parallelism not supported everywhere"
+    While most standard procedures are now supported in combination with MPI,
+    some functionality is still missing and may error out when being called
+    in an MPI-parallel run.
     In most cases there is no intrinsic limitation it just has not yet been
     implemented. If you require MPI in one of our routines, where this is not
     yet supported, feel free to open an issue on github or otherwise get in touch.

diff --git a/examples/cuda.jl b/examples/cuda.jl