Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonbonded kernel #194

Merged
merged 17 commits into from
Jan 14, 2025
Merged

Nonbonded kernel #194

merged 17 commits into from
Jan 14, 2025

Conversation

Peppone98
Copy link
Collaborator

Nonbonded force calculation using GPU block-based algorithm

This PR introduces a GPU-optimized algorithm for the calculation of nonbonded forces. The algorithm subdivides the system into blocks of 32 particles, evaluates distances between blocks, and computes forces only for interacting blocks within a cutoff radius. The algorithm is incorporated into OpenMM and it is described in a paper by Eastman and Pande (https://onlinelibrary.wiley.com/doi/full/10.1002/jcc.21413).

This new implementation significantly accelerates nonbonded force calculations for large systems as its performance scales linearly with the number of atoms.

Implementation details

  • The algorithm divides the N atoms in bounding boxes of 32. A bounding box is the smallest rectangular, axis-aligned box that contains 32 atoms. The identification of these boxes is a O(N) problem.
  • For reasons of spatial coherence, a simple Morton code-based reordering algorithm is employed to ensure that atoms close in space are also close to each other in a reoredered array of indices. This is done by diving the 3D space into cubic small voxels with a side length w slightly smaller than the cutoff radius. The sequence obtained is used in the functions kernel_min_max!, force_kernel! and energy_kernel.
  • kernel_min_max! : computes the arrays containing minima and maxima of the bounding boxes
  • force_kernel!: computes the nonbonded forces between particles within interacting bounding boxes
  • energy_kernel: computes the nonbonded potential energy

Let i and j be the indices of the bounding boxes and let n_blocks be the allocated number of blocks of GPU threads, which matches exactly the number of bounding boxes. Each couple of bounding boxes forms a tile and the 32 threads allocated in each block compute the interactions within each tile. The structure of force_kernel! and energy_kernel is divided in 4 main parts:

  1. j < n_blocks && i < j: computes the forces within 32x32 tiles by shuffling the coordinates, velocities indices of the particles and Atom objects
  2. j == n_blocks && i < n_blocks: computes the forces between the r=N%32 particles in the last box and the particles in the other boxes of 32.
  3. i == j && i < n_blocks: computes interaction within boxes of 32
  4. i == n_blocks && j == n_blocks: computes interactions within the last box of r particles.

In cases 1 and 2, for each tile that has been identified as interacting, the distance of each atom in the first bounding box is computed with respect to the second bounding box and if not within the cutoff a flag is set for that atoms to skip the computation.

Observations and new neighbor finder (GPUNeighborFinder)

  • The algorithm does not require a neighbor list to be passed as an input
  • The computation of the maxima and the minima is performed (by default) every 10 steps of dynamics. However, the user can modify this value by setting explicitly the field n_steps_reorder of GPUNeighborFinder.
  • force_kernel! and energy_kernel make explicit use of the matrices eligible and special defined in GPUNeighborFinder

GPUNeighborFinder: new type of neighbor finder for systems on GPU. For GPUNeighborFinder systems, the function find_neighbors returns nothing as a neighbor list. This because the new calculation does not require previous knowledge of the neighbor list. The new default choices for the setup of the system are displayed below (lines 890-914 of setup.jl):

if gpu
    neighbor_finder = GPUNeighborFinder(
        eligible=CuArray(eligible),
        dist_cutoff=T(dist_neighbors),
        special=CuArray(special),
        n_steps_reorder=10,
        initialized=false,
    )
elseif use_cell_list
    neighbor_finder = CellListMapNeighborFinder(
        eligible=eligible,
        special=special,
        n_steps=10,
        x0=coords,
        unit_cell=boundary_used,
        dist_cutoff=T(dist_neighbors),
    )
else
    neighbor_finder = DistanceNeighborFinder(
        eligible=eligible,
        special=special,
        n_steps=10,
        dist_cutoff=T(dist_neighbors),
    )
end

A new struct is introduced to store the forces matrix, the arrays of minima and maxima together with the sequence of atom indices ordered according to the Morton code.

struct ForcesBuffer{F, C}
    fs_mat::F
    box_mins::C
    box_maxs::C
    Morton_seq::CuArray{Int32}
end

The function forces_nounits! takes now a ForcesBuffer object as an input. While the forces matrix is updated at each step of dynamics, the minima and maxima arrays and the Morton sequence are only updated every n_steps_reorder steps.

Some tests for this GPU implementation are added in the test/simulation.jl and test/protein.jl.

Performance evaluation

Benchmark of forces(sys) performed on 6mrr_equil.pdb (protein system with 15954 atoms) for the proposed implementation and the previous implementation in Molly:

Float32 Float64
with reordering 2.724 ms ± 340.828 μs 8.018 ms ± 325.429 μs
without reordering 1.983 ms ± 26.878 μs 7.134 ms ± 43.339 μs
previous implementation (with neighbors calculations) 15.084 ms ± 399.510 μs 20.113 ms ± 435.548
previous implementation (without neighbors calculations) 1.257 ms ± 115.581 μs 1.585 ms ± 80.793 μs

The old implementation still outperforms the new one in real systems simulations where the neighbor list is not computed at every step. However, I believe that small fixes can adjust the behaviour of the kernel in long simulations as the benchmark with the reordering seems encouraging.

The command potential_energy(sys) always performs the reordering step in the new implementation since the potential energy will be computed much less frequently than the forces during a simulation.

Float32 Float64
with reordering 2.028 ms ± 254.345 μs 7.783 ms ± 371.269 μs
previous implementation 20.036 ms ± 523.120 μs 24.720 ms ± 488.714 μs

A possible setup for a benchmark with Float64 precision:

n_atoms = 16_000
n_steps = 500
temp = 298.0u"K"
boundary = CubicBoundary(6.0u"nm")
coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm")
velocities = [random_velocity(10.0u"g/mol", temp) for i in 1:n_atoms]
simulator = VelocityVerlet(dt=0.002u"ps")

s = System(
    atoms=[Atom(mass=10.0u"g/mol", charge=0.0, σ=0.05u"nm", ϵ=0.2u"kJ * mol^-1") for i in 1:n_atoms],
    coords=coords,
    velocities=velocities,
    boundary=boundary,
    pairwise_inters=(LennardJones(cutoff=DistanceCutoff(1.0u"nm"), use_neighbors=true),),
    neighbor_finder=DistanceNeighborFinder(
        eligible=trues(n_atoms, n_atoms),
        n_steps=10,
        dist_cutoff=1.2u"nm",
    ),
    loggers=(coords=CoordinatesLogger(100),),
)

s_gpu = System(
    atoms=CuArray([Atom(mass=10.0u"g/mol", charge=0.0, σ=0.05u"nm", ϵ=0.2u"kJ * mol^-1") for i in 1:n_atoms]),
    coords=CuArray(coords),
    velocities=CuArray(velocities),
    boundary=boundary,
    pairwise_inters=(LennardJones(cutoff=DistanceCutoff(1.0u"nm"), use_neighbors=true),),
    neighbor_finder=GPUNeighborFinder(
        eligible=CuArray(trues(n_atoms, n_atoms)),
        n_steps_reorder=10,
        dist_cutoff=1.0u"nm",
    ),
    loggers=(coords=CoordinatesLogger(100),),
)

# CPU 
simulate!(deepcopy(s), simulator, 20; n_threads=1)
@time simulate!(s, simulator, n_steps; n_threads=1)

# GPU
simulate!(deepcopy(s_gpu), simulator, 20; n_threads=1)
@time simulate!(s_gpu, simulator, n_steps; n_threads=1)

Copy link
Collaborator

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. I have left most comments inline. I can have a look at the performance too. Also:

  • Does it work for 2D systems? You can extract D as a type parameter from the SVectors and use things like zero(SVector{D, T}) rather than SVector{3, T}(zero(T), zero(T), zero(T)) to make it generic in this sense. You could add a GPU test to the 2D test at the top of test/simulation.jl.
  • Remove the empty file with a strange name.
  • You will need to rebase/merge again from master due to recent changes, sorry about that.


coords = place_atoms(n_atoms, boundary; min_dist=0.1u"nm")

if gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth having nl as a Bool argument to the function and testing both with and without the neighbour list.

test/protein.jl Outdated Show resolved Hide resolved
test/protein.jl Outdated Show resolved Hide resolved
@@ -221,8 +221,75 @@ end
)
random_velocities!(s, temp)

s_gpu = System(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests run similar things, so it might just be worth keeping one for speed.

src/simulators.jl Outdated Show resolved Hide resolved
src/cuda.jl Outdated Show resolved Hide resolved
src/cuda.jl Outdated Show resolved Hide resolved
src/cuda.jl Outdated Show resolved Hide resolved
src/cuda.jl Outdated Show resolved Hide resolved
src/cuda.jl Outdated Show resolved Hide resolved
Copy link
Collaborator

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, just minor stuff before it's ready to merge.

src/neighbors.jl Outdated Show resolved Hide resolved
src/cuda.jl Outdated
)
end
return esc(Expr(:block, all_lines...))
all_lines = map(vars) do v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is now consistent in this file but is 8 spaces, 4 would be consistent with the rest of the code.

src/cuda.jl Outdated
N = length(sys.coords)
n_blocks = cld(N, WARPSIZE)
r_cut = sys.neighbor_finder.dist_cutoff
if step_n % sys.neighbor_finder.n_steps_reorder == 0 || step_n == 1 || !sys.neighbor_finder.initialized
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to remove the step_n == 1 check now to avoid extra computation after step 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a simulation has already been performed on a system, then the flag initialized is set to true. This is a problem when starting a new simulation for the same system with Verlet or Langevin for example, since in these cases the calculation of the forces starts inside the loop over the time steps.

test/simulation.jl Outdated Show resolved Hide resolved
@Peppone98
Copy link
Collaborator Author

Details of the new implementation

This version introduces several improvements to the computation of forces. The process can be summarized as below

  1. Spatial sorting: atom indices are spatially sorted based on the Morton code using the sorted_Morton_seq function.
  2. Box edges computation: the edges of the boxes are computed efficiently with kernel_min_max!.
  3. Boolean matrix compression: the Boolean matrices sys.neighbor_finder.eligible and sys.neighbor_finder.special are compressed into UInt32 representations and ordered via the new compress_boolean_matrices! kernel. This compression eliminates the overhead associated with reading matrices in the force_kernel!. The kernel now accesses only a single element instead of 32.
  4. Force kernel execution: force_kernel! is invoked for 10 iterations consecutively, using the same boxes and matrices stored in the ForcesBuffer struct. The number of iterations can be adjusted by setting the n_steps_reorder parameter in GPUNeighborFinder when defining the system. Setting the always_inline flag to true in the kernel call significantly improves performance. Below, the new structure of ForcesBuffer:
struct ForcesBuffer{F, C, M, R}
    fs_mat::F
    box_mins::C
    box_maxs::C
    Morton_seq::M
    compressed_eligible::R
    compressed_special::R
end

Notable Changes to force_kernel!

  • Shared memory optimization: The kernel now minimises the use of shared memory, employing it solely for storing forces, while maximising the use of fast thread registers.
  • Bit rotation: elements of the compressed UInt32 matrices are "rotated" during each interaction evaluation, where the most significant bit becomes the least significant bit. See the example below with m in 1:warpsize()
excl = (eligible_bitmask >> (warpsize() - m)) | (eligible_bitmask << m)

Note on potential energy calculation

The calculation of potential energy remains unchanged. Direct access to Boolean matrices via shared memory continues to be faster than the two step process of retrieving the compressed matrices and accessing them within the kernel.

Standalone force computation

Benchmark of forces(sys) performed on 6mrr_equil.pdb (protein system with 15954 atoms) for the proposed implementation and the previous implementation in Molly:

Float32 Float64
with reordering 2.164 ms ± 198.713 μs 8.109 ms ± 364.257 μs
without reordering 491.569 μs ± 9.123 μs 5.867 ms ± 78.764 μs
previous implementation (with neighbors calculations) 15.084 ms ± 399.510 μs 20.113 ms ± 435.548
previous implementation (without neighbors calculations) 1.257 ms ± 115.581 μs 1.585 ms ± 80.793 μs

Performances in simulations

Comparison between the performances in simulations of 6mrr_equil.pdb and a system of 16000 LJ particles placed randomly in a cubic box of 6 nm. The Velocity Verlet algorithm was used, and the reported times correspond to 1000 simulation steps:

Float32 Float64
6mrr_equil.pdb 0.78 s 6.41 s
Random system 0.82 s 4.93 s
6mrr_equil.pdb, previous implementation 2.82 s 3.62 s
Random system, previous implementation 2.08 s 3.19 s

These tests were condicted on a single NVIDIA GeForce RTX 4090 GPU.
As indicated by the GPU specifications at https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889, the theoretical performance is 82.58 TFLOPS for Float32 operations but only 1,290 GFLOPS (1:64 ratio) for Float64 operations. This significant disparity reflects the hardware's de-prioritization of Float64 operations, leading to their markedly slower performance.

Validation

All tests in the GROUP=NotGradients collection passed successfully.

Performance in smaller systems

The advantages of the new algorithm are evident even in smaller systems. For example:

n_steps = 1_000
temp = 298.0f0
press = Float32(ustrip(u"u * nm^-1 * ps^-2", 1.0f0u"bar"))
s = System(
    Float32,
    joinpath(data_dir, "5XER", "gmx_coords.gro"),
    joinpath(data_dir, "5XER", "gmx_top_ff.top");
    loggers=(
        temp=TemperatureLogger(Float32, 10),
        coords=CoordinatesLogger(Float32, 10),
        energy=TotalEnergyLogger(Float32, 10),
    ),
    gpu=true,
    units=false,
)
thermostat = AndersenThermostat(temp, 10.0f0)
barostat = MonteCarloBarostat(press, temp, s.boundary; n_steps=20)
simulator = VelocityVerlet(dt=0.002f0, coupling=(thermostat, barostat))

atoms = Array(s.atoms)
s.velocities = CuArray([random_velocity(mass(a), temp) .* 0.01f0 for a in atoms])
simulate!(deepcopy(s), simulator, 100; n_threads=1)
@time simulate!(s, simulator, n_steps; n_threads=1)

This simulation of 1000 steps is performed in 0.48 s, while the old implementation takes 1.54 s.

Copy link

codecov bot commented Jan 13, 2025

Codecov Report

Attention: Patch coverage is 3.65854% with 553 lines in your changes missing coverage. Please review.

Project coverage is 67.01%. Comparing base (32f5c73) to head (a753577).
Report is 18 commits behind head on master.

Files with missing lines Patch % Lines
src/cuda.jl 0.00% 517 Missing ⚠️
src/force.jl 9.09% 20 Missing ⚠️
src/neighbors.jl 0.00% 10 Missing ⚠️
src/energy.jl 0.00% 3 Missing ⚠️
src/setup.jl 62.50% 3 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (32f5c73) and HEAD (a753577). Click for more details.

HEAD has 8 uploads less than BASE
Flag BASE (32f5c73) HEAD (a753577)
10 2
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #194      +/-   ##
==========================================
- Coverage   74.20%   67.01%   -7.20%     
==========================================
  Files          35       35              
  Lines        5028     5526     +498     
==========================================
- Hits         3731     3703      -28     
- Misses       1297     1823     +526     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge after 2/6 tests pass.

@Peppone98 Peppone98 merged commit 636c9da into master Jan 14, 2025
4 of 10 checks passed
@Peppone98 Peppone98 deleted the nonbonded-kernel branch January 14, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants