Nonbonded kernel #194

Peppone98 · 2024-12-19T19:48:20Z

Nonbonded force calculation using GPU block-based algorithm

This PR introduces a GPU-optimized algorithm for the calculation of nonbonded forces. The algorithm subdivides the system into blocks of 32 particles, evaluates distances between blocks, and computes forces only for interacting blocks within a cutoff radius. The algorithm is incorporated into OpenMM and it is described in a paper by Eastman and Pande (https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21413).

This new implementation significantly accelerates nonbonded force calculations for large systems as its performance scales linearly with the number of atoms.

Implementation details

The algorithm divides the N atoms in bounding boxes of 32. A bounding box is the smallest rectangular, axis-aligned box that contains 32 atoms. The identification of these boxes is a O(N) problem.
For reasons of spatial coherence, a simple Morton code-based reordering algorithm is employed to ensure that atoms close in space are also close to each other in a reoredered array of indices. This is done by diving the 3D space into cubic small voxels with a side length w slightly smaller than the cutoff radius. The sequence obtained is used in the functions kernel_min_max!, force_kernel! and energy_kernel.
kernel_min_max! : computes the arrays containing minima and maxima of the bounding boxes
force_kernel!: computes the nonbonded forces between particles within interacting bounding boxes
energy_kernel: computes the nonbonded potential energy

Let i and j be the indices of the bounding boxes and let n_blocks be the allocated number of blocks of GPU threads, which matches exactly the number of bounding boxes. Each couple of bounding boxes forms a tile and the 32 threads allocated in each block compute the interactions within each tile. The structure of force_kernel! and energy_kernel is divided in 4 main parts:

j < n_blocks && i < j: computes the forces within 32x32 tiles by shuffling the coordinates, velocities indices of the particles and Atom objects
j == n_blocks && i < n_blocks: computes the forces between the r=N%32 particles in the last box and the particles in the other boxes of 32.
i == j && i < n_blocks: computes interaction within boxes of 32
i == n_blocks && j == n_blocks: computes interactions within the last box of r particles.

In cases 1 and 2, for each tile that has been identified as interacting, the distance of each atom in the first bounding box is computed with respect to the second bounding box and if not within the cutoff a flag is set for that atoms to skip the computation.

Observations and new neighbor finder (GPUNeighborFinder)

The algorithm does not require a neighbor list to be passed as an input
The computation of the maxima and the minima is performed (by default) every 10 steps of dynamics. However, the user can modify this value by setting explicitly the field n_steps_reorder of GPUNeighborFinder.
force_kernel! and energy_kernel make explicit use of the matrices eligible and special defined in GPUNeighborFinder

GPUNeighborFinder: new type of neighbor finder for systems on GPU. For GPUNeighborFinder systems, the function find_neighbors returns nothing as a neighbor list. This because the new calculation does not require previous knowledge of the neighbor list. The new default choices for the setup of the system are displayed below (lines 890-914 of setup.jl):

if gpu
    neighbor_finder = GPUNeighborFinder(
        eligible=CuArray(eligible),
        dist_cutoff=T(dist_neighbors),
        special=CuArray(special),
        n_steps_reorder=10,
        initialized=false,
    )
elseif use_cell_list
    neighbor_finder = CellListMapNeighborFinder(
        eligible=eligible,
        special=special,
        n_steps=10,
        x0=coords,
        unit_cell=boundary_used,
        dist_cutoff=T(dist_neighbors),
    )
else
    neighbor_finder = DistanceNeighborFinder(
        eligible=eligible,
        special=special,
        n_steps=10,
        dist_cutoff=T(dist_neighbors),
    )
end

A new struct is introduced to store the forces matrix, the arrays of minima and maxima together with the sequence of atom indices ordered according to the Morton code.

struct ForcesBuffer{F, C}
    fs_mat::F
    box_mins::C
    box_maxs::C
    Morton_seq::CuArray{Int32}
end

The function forces_nounits! takes now a ForcesBuffer object as an input. While the forces matrix is updated at each step of dynamics, the minima and maxima arrays and the Morton sequence are only updated every n_steps_reorder steps.

Some tests for this GPU implementation are added in the test/simulation.jl and test/protein.jl.

Performance evaluation

Benchmark of forces(sys) performed on 6mrr_equil.pdb (protein system with 15954 atoms) for the proposed implementation and the previous implementation in Molly:

	`Float32`	`Float64`
with reordering	2.724 ms ± 340.828 μs	8.018 ms ± 325.429 μs
without reordering	1.983 ms ± 26.878 μs	7.134 ms ± 43.339 μs
previous implementation (with neighbors calculations)	15.084 ms ± 399.510 μs	20.113 ms ± 435.548
previous implementation (without neighbors calculations)	1.257 ms ± 115.581 μs	1.585 ms ± 80.793 μs

The old implementation still outperforms the new one in real systems simulations where the neighbor list is not computed at every step. However, I believe that small fixes can adjust the behaviour of the kernel in long simulations as the benchmark with the reordering seems encouraging.

The command potential_energy(sys) always performs the reordering step in the new implementation since the potential energy will be computed much less frequently than the forces during a simulation.

	`Float32`	`Float64`
with reordering	2.028 ms ± 254.345 μs	7.783 ms ± 371.269 μs
previous implementation	20.036 ms ± 523.120 μs	24.720 ms ± 488.714 μs

A possible setup for a benchmark with Float64 precision:

n_atoms = 16_000
n_steps = 500
temp = 298.0u"K"
boundary = CubicBoundary(6.0u"nm")
coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm")
velocities = [random_velocity(10.0u"g/mol", temp) for i in 1:n_atoms]
simulator = VelocityVerlet(dt=0.002u"ps")

s = System(
    atoms=[Atom(mass=10.0u"g/mol", charge=0.0, σ=0.05u"nm", ϵ=0.2u"kJ * mol^-1") for i in 1:n_atoms],
    coords=coords,
    velocities=velocities,
    boundary=boundary,
    pairwise_inters=(LennardJones(cutoff=DistanceCutoff(1.0u"nm"), use_neighbors=true),),
    neighbor_finder=DistanceNeighborFinder(
        eligible=trues(n_atoms, n_atoms),
        n_steps=10,
        dist_cutoff=1.2u"nm",
    ),
    loggers=(coords=CoordinatesLogger(100),),
)

s_gpu = System(
    atoms=CuArray([Atom(mass=10.0u"g/mol", charge=0.0, σ=0.05u"nm", ϵ=0.2u"kJ * mol^-1") for i in 1:n_atoms]),
    coords=CuArray(coords),
    velocities=CuArray(velocities),
    boundary=boundary,
    pairwise_inters=(LennardJones(cutoff=DistanceCutoff(1.0u"nm"), use_neighbors=true),),
    neighbor_finder=GPUNeighborFinder(
        eligible=CuArray(trues(n_atoms, n_atoms)),
        n_steps_reorder=10,
        dist_cutoff=1.0u"nm",
    ),
    loggers=(coords=CoordinatesLogger(100),),
)

# CPU 
simulate!(deepcopy(s), simulator, 20; n_threads=1)
@time simulate!(s, simulator, n_steps; n_threads=1)

# GPU
simulate!(deepcopy(s_gpu), simulator, 20; n_threads=1)
@time simulate!(s_gpu, simulator, n_steps; n_threads=1)

jgreener64

Great work. I have left most comments inline. I can have a look at the performance too. Also:

Does it work for 2D systems? You can extract D as a type parameter from the SVectors and use things like zero(SVector{D, T}) rather than SVector{3, T}(zero(T), zero(T), zero(T)) to make it generic in this sense. You could add a GPU test to the 2D test at the top of test/simulation.jl.
Remove the empty file with a strange name.
You will need to rebase/merge again from master due to recent changes, sorry about that.

jgreener64 · 2024-12-20T15:12:11Z

test/energy_conservation.jl

-
+            coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm")
+
+            if gpu


Might be worth having nl as a Bool argument to the function and testing both with and without the neighbour list.

test/protein.jl

jgreener64 · 2024-12-20T15:18:14Z

test/simulation.jl

@@ -221,8 +221,75 @@ end
    )
    random_velocities!(s, temp)

+    s_gpu = System(


These new tests run similar things, so it might just be worth keeping one for speed.

src/simulators.jl

src/cuda.jl

…tests

…hat use the new GPU force kernel

jgreener64

Great, just minor stuff before it's ready to merge.

src/neighbors.jl

jgreener64 · 2025-01-13T15:29:59Z

src/cuda.jl

-        )
-    end
-    return esc(Expr(:block, all_lines...))
+	all_lines = map(vars) do v


Indentation is now consistent in this file but is 8 spaces, 4 would be consistent with the rest of the code.

jgreener64 · 2025-01-13T15:31:17Z

src/cuda.jl

+		N = length(sys.coords)
+		n_blocks = cld(N, WARPSIZE)
+		r_cut = sys.neighbor_finder.dist_cutoff
+		if step_n % sys.neighbor_finder.n_steps_reorder == 0 || step_n == 1 || !sys.neighbor_finder.initialized


Might be better to remove the step_n == 1 check now to avoid extra computation after step 0.

If a simulation has already been performed on a system, then the flag initialized is set to true. This is a problem when starting a new simulation for the same system with Verlet or Langevin for example, since in these cases the calculation of the forces starts inside the loop over the time steps.

test/simulation.jl

Peppone98 · 2025-01-13T17:54:06Z

Details of the new implementation

This version introduces several improvements to the computation of forces. The process can be summarized as below

Spatial sorting: atom indices are spatially sorted based on the Morton code using the sorted_Morton_seq function.
Box edges computation: the edges of the boxes are computed efficiently with kernel_min_max!.
Boolean matrix compression: the Boolean matrices sys.neighbor_finder.eligible and sys.neighbor_finder.special are compressed into UInt32 representations and ordered via the new compress_boolean_matrices! kernel. This compression eliminates the overhead associated with reading matrices in the force_kernel!. The kernel now accesses only a single element instead of 32.
Force kernel execution: force_kernel! is invoked for 10 iterations consecutively, using the same boxes and matrices stored in the ForcesBuffer struct. The number of iterations can be adjusted by setting the n_steps_reorder parameter in GPUNeighborFinder when defining the system. Setting the always_inline flag to true in the kernel call significantly improves performance. Below, the new structure of ForcesBuffer:

struct ForcesBuffer{F, C, M, R}
    fs_mat::F
    box_mins::C
    box_maxs::C
    Morton_seq::M
    compressed_eligible::R
    compressed_special::R
end

Notable Changes to force_kernel!

Shared memory optimization: The kernel now minimises the use of shared memory, employing it solely for storing forces, while maximising the use of fast thread registers.
Bit rotation: elements of the compressed UInt32 matrices are "rotated" during each interaction evaluation, where the most significant bit becomes the least significant bit. See the example below with m in 1:warpsize()

excl = (eligible_bitmask >> (warpsize() - m)) | (eligible_bitmask << m)

Note on potential energy calculation

The calculation of potential energy remains unchanged. Direct access to Boolean matrices via shared memory continues to be faster than the two step process of retrieving the compressed matrices and accessing them within the kernel.

Standalone force computation

Benchmark of forces(sys) performed on 6mrr_equil.pdb (protein system with 15954 atoms) for the proposed implementation and the previous implementation in Molly:

	`Float32`	`Float64`
with reordering	2.164 ms ± 198.713 μs	8.109 ms ± 364.257 μs
without reordering	491.569 μs ± 9.123 μs	5.867 ms ± 78.764 μs
previous implementation (with neighbors calculations)	15.084 ms ± 399.510 μs	20.113 ms ± 435.548
previous implementation (without neighbors calculations)	1.257 ms ± 115.581 μs	1.585 ms ± 80.793 μs

Performances in simulations

Comparison between the performances in simulations of 6mrr_equil.pdb and a system of 16000 LJ particles placed randomly in a cubic box of 6 nm. The Velocity Verlet algorithm was used, and the reported times correspond to 1000 simulation steps:

	`Float32`	`Float64`
`6mrr_equil.pdb`	0.78 s	6.41 s
Random system	0.82 s	4.93 s
`6mrr_equil.pdb`, previous implementation	2.82 s	3.62 s
Random system, previous implementation	2.08 s	3.19 s

These tests were condicted on a single NVIDIA GeForce RTX 4090 GPU.
As indicated by the GPU specifications at https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889, the theoretical performance is 82.58 TFLOPS for Float32 operations but only 1,290 GFLOPS (1:64 ratio) for Float64 operations. This significant disparity reflects the hardware's de-prioritization of Float64 operations, leading to their markedly slower performance.

Validation

All tests in the GROUP=NotGradients collection passed successfully.

Performance in smaller systems

The advantages of the new algorithm are evident even in smaller systems. For example:

n_steps = 1_000
temp = 298.0f0
press = Float32(ustrip(u"u * nm^-1 * ps^-2", 1.0f0u"bar"))
s = System(
    Float32,
    joinpath(data_dir, "5XER", "gmx_coords.gro"),
    joinpath(data_dir, "5XER", "gmx_top_ff.top");
    loggers=(
        temp=TemperatureLogger(Float32, 10),
        coords=CoordinatesLogger(Float32, 10),
        energy=TotalEnergyLogger(Float32, 10),
    ),
    gpu=true,
    units=false,
)
thermostat = AndersenThermostat(temp, 10.0f0)
barostat = MonteCarloBarostat(press, temp, s.boundary; n_steps=20)
simulator = VelocityVerlet(dt=0.002f0, coupling=(thermostat, barostat))

atoms = Array(s.atoms)
s.velocities = CuArray([random_velocity(mass(a), temp) .* 0.01f0 for a in atoms])
simulate!(deepcopy(s), simulator, 100; n_threads=1)
@time simulate!(s, simulator, n_steps; n_threads=1)

This simulation of 1000 steps is performed in 0.48 s, while the old implementation takes 1.54 s.

codecov · 2025-01-13T18:58:53Z

Codecov Report

Attention: Patch coverage is 3.65854% with 553 lines in your changes missing coverage. Please review.

Project coverage is 67.01%. Comparing base (32f5c73) to head (a753577).
Report is 18 commits behind head on master.

Files with missing lines	Patch %	Lines
src/cuda.jl	0.00%	517 Missing ⚠️
src/force.jl	9.09%	20 Missing ⚠️
src/neighbors.jl	0.00%	10 Missing ⚠️
src/energy.jl	0.00%	3 Missing ⚠️
src/setup.jl	62.50%	3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (32f5c73) and HEAD (a753577). Click for more details.

HEAD has 8 uploads less than BASE

Flag BASE (32f5c73) HEAD (a753577)

10 2

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #194      +/-   ##
==========================================
- Coverage   74.20%   67.01%   -7.20%     
==========================================
  Files          35       35              
  Lines        5028     5526     +498     
==========================================
- Hits         3731     3703      -28     
- Misses       1297     1823     +526

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jgreener64

Merge after 2/6 tests pass.

jgreener64 reviewed Dec 20, 2024

View reviewed changes

Peppone98 added 13 commits January 13, 2025 14:02

force and energy kernels with Eastman's algorithm

12b91bc

functions forces and potential_energy modified based on NotGradients …

9e5922a

…tests

Adjusted force kernel

1a28366

Protein.jl tests successfully passed

35eb57b

up to tests in basic.jl line 187

812edc9

NotGradients tests successfully passed

e55e9d3

removed inline flag to speed up simulators

8735dce

partial update: removal of ! in simulate! speeds up the simulations t…

8809bfb

…hat use the new GPU force kernel

adjusting simulate!

b8449b5

adding bit rotation and changes in Morton functions

b889b80

adding the kernel for compression of eligible and special

e709f4b

removed shared memory for elig and spec in force calculation

d08ed4c

remove strange file

ee448d6

Peppone98 force-pushed the nonbonded-kernel branch from 0fd0f88 to ee448d6 Compare January 13, 2025 14:04

jgreener64 reviewed Jan 13, 2025

View reviewed changes

minor changes

15add51

adding run_gpu_tests, removing Union and fixing indentation

19f56f8

removal of n_step==1

eddab6a

jgreener64 approved these changes Jan 14, 2025

View reviewed changes

adjusting potential_energy

a753577

Peppone98 merged commit 636c9da into master Jan 14, 2025
4 of 10 checks passed

Peppone98 deleted the nonbonded-kernel branch January 14, 2025 16:48

jgreener64 mentioned this pull request Jan 14, 2025

KernelAbstractions support #147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nonbonded kernel #194

Nonbonded kernel #194

Peppone98 commented Dec 19, 2024

jgreener64 left a comment

jgreener64 Dec 20, 2024

jgreener64 Dec 20, 2024

jgreener64 left a comment

jgreener64 Jan 13, 2025

jgreener64 Jan 13, 2025

Peppone98 Jan 13, 2025

Peppone98 commented Jan 13, 2025

codecov bot commented Jan 13, 2025 •

edited

Loading

jgreener64 left a comment


		coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm")

		if gpu

Nonbonded kernel #194

Nonbonded kernel #194

Conversation

Peppone98 commented Dec 19, 2024

Nonbonded force calculation using GPU block-based algorithm

Implementation details

Observations and new neighbor finder (GPUNeighborFinder)

Performance evaluation

jgreener64 left a comment

Choose a reason for hiding this comment

jgreener64 Dec 20, 2024

Choose a reason for hiding this comment

jgreener64 Dec 20, 2024

Choose a reason for hiding this comment

jgreener64 left a comment

Choose a reason for hiding this comment

jgreener64 Jan 13, 2025

Choose a reason for hiding this comment

jgreener64 Jan 13, 2025

Choose a reason for hiding this comment

Peppone98 Jan 13, 2025

Choose a reason for hiding this comment

Peppone98 commented Jan 13, 2025

Details of the new implementation

Notable Changes to force_kernel!

Note on potential energy calculation

Standalone force computation

Performances in simulations

Validation

Performance in smaller systems

codecov bot commented Jan 13, 2025 • edited Loading

Codecov Report

jgreener64 left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 13, 2025 •

edited

Loading