Minor optimization of the LOBPCG solver #1037

abussy · 2024-12-18T16:46:12Z

This PR was also motivated by monitoring memory allocations in the built-in DFTK.timer, and the memory allocated by ortho! X vs Y in particular.

This orthogonalization function has a loop, where potentially large matrix allocations take place at each step. Since these matrices keep the same dimensions, it makes more sense to have a single allocation followed by in-place multiplications.

Additionally, it was noted that quite some time is spent in rayleigh_ritz, in a matrix-matrix multiplication known to result in an Hermitian product. Since matrices are in a non-standard LazyHcat format, no automatic optimization can take place. This PR adds a new function mul_hermi for LazyHcat matrices, which only computes the upper block diagonal, and populates the rest of the function by adding its adjoint. This results in savings of the order of 30%.

mfherbst · 2024-12-19T07:40:44Z

This results in savings of the order of 30%.

Of time to do the Rayleigh-Ritz step ?

mfherbst

Generally very nice, thanks.

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

src/eigen/lobpcg_hyper_impl.jl

abussy · 2024-12-19T10:17:21Z

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

Overall, for multiple test systems, I observed a ~2.5 % speedup on the overall timings. While this is not huge by itself, multiple such small optimizations over the code may add up to something significant.

mfherbst · 2024-12-21T19:01:56Z

@abussy I think for mul_hermi an example showing a clear difference in timings would be good if you happen to have one handy. Just for @antoine-levitt to get convinced whether the extra code is worth it 😄.

abussy · 2025-01-10T16:44:17Z

I cleaned-up this PR to make it the least disruptive possible. I only kept what I think is important, namely the mul_hermi() function for the product of LazyHcat matrices known to be Hermitian, and a redefinition of mul!(A, B, C, alpha, beta). In the latter case, there was an explicit MM with the identity (clearly a waste).

For the mul_hermi() function, I observe up to 4% efficiency gains on the SCF. For a single additional function of 20 lines, I think it's a bargain. Making small steps like this can eventually lead to significant gains. That's my opinion at least.

As @mfherbst suggested, I am also providing an example. Generally, the speed-up will be the most visible for systems with a large number of electrons (and/or high Ecut). For the aluminium supercell below, I gained about 30s of runtime (on average, out of ~700, on my laptop).

using DFTK                                                                                           
using PseudoPotentialData                                                                            
setup_threading()                                                                                    
                                                                                                     
Ecut = 32.0                                                                                          
kgrid = [1, 1, 1]                                                                                    
maxiter = 10                                                                                         
tol = 1.0e-8                                                                                         
                                                                                                     
factor = 4                                                                                           
a = 3.8267                                                                                           
lattice = factor*a * [[0.0 1.0 1.0];                                                                 
                      [1.0 0.0 1.0];                                                                 
                      [1.0 1.0 0.0]]                                                                 
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))                            
atoms = [Al for i in 1:factor^3]                                                                     
positions = Vector{Vector{Float64}}([])                                                              
for i = 1:factor, j = 1:factor, k=1:factor                                                           
   push!(positions, [i/factor, j/factor, k/factor])                                                  
end                                                                                                  
                                                                                                     
model = model_DFT(lattice, atoms, positions; temperature=1e-4,                                       
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())                              
                                                                                                     
#actual calculations                                                                                 
DFTK.reset_timer!(DFTK.timer)                                                                        
basis  = PlaneWaveBasis(model; Ecut, kgrid)                                                          
scfres = self_consistent_field(basis; maxiter=maxiter, tol=tol);                                     
@show DFTK.timer

mfherbst

up to 4% efficiency gains on the SCF.

If this is generally the case, I'd say it's worth it, but on my quick tests I get slightly less optimistic results. Let's discuss this next week.

src/eigen/lobpcg_hyper_impl.jl

abussy · 2025-01-28T16:54:54Z

Implemented the suggested changes. Requires the removal of the @assert !any(isnan, XAX) in rayleigh_ritz() before the diagonalization (because of disallowed scalar access on GPU, when XAX is already wrapped as Hermitian). I believe this is safe, as the eigen() function fails loudly with LoadError: ArgumentError: matrix contains Infs or NaNs when NaNs are present.

mfherbst · 2025-01-31T08:04:31Z

This version is fine with me.

Requires the removal of the @assert !any(isnan, XAX)

This is probably ok, but I recall we added it because of the GPU version (where apparently cuBlas did not check this back when). Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns.

@antoine-levitt Ok with this PR as it stands ?

antoine-levitt

Nice PR, some nits but good to go. Mul_hermi is a generally useful function we should ideally wrap more generally, but the problem is that this is a blind spot for BLAS (there's no BLAS routine for A*B, assuming the result is hermitian; there is for A^T A however, which is called automatically by julia) so it's fine to hardcode something in this case.

antoine-levitt · 2025-01-31T08:31:24Z

src/eigen/lobpcg_hyper_impl.jl

@@ -100,6 +100,28 @@ end

 Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

+# Special case of Hermitian result: can only actively compute the block upper diagonal


Describe what the function does instead of how it's implemented: something like "Computes A*B, assuming the result is hermitian"

antoine-levitt · 2025-01-31T08:31:34Z

src/eigen/lobpcg_hyper_impl.jl

@@ -100,6 +100,28 @@ end

 Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

+# Special case of Hermitian result: can only actively compute the block upper diagonal
+@views function mul_hermi(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}


(mul does AB, not AadjB)

antoine-levitt · 2025-01-31T08:33:23Z

src/eigen/lobpcg_hyper_impl.jl

+    Hermitian(ret)
+end
+
+mul_hermi(Aadj::AbstractArray{T}, B::AbstractArray{T}) where {T} = Hermitian(Aadj * B)


put the generic before the specialization

(also no need to type them as abstractarray, or get T just do mul_hermi(A,B) = Hermitian(A*B)

antoine-levitt · 2025-01-31T08:35:15Z

src/eigen/lobpcg_hyper_impl.jl

    mul!(res, Ablock*B, I, α, β)
 end

 # Perform a Rayleigh-Ritz for the N first eigenvectors.
 @timing function rayleigh_ritz(X, AX, N)
-    XAX = X' * AX
-    @assert !any(isnan, XAX)


Why? This is useful to debug cases where A*x is badly implemented and runs into NaNs.

mfherbst reviewed Dec 19, 2024

View reviewed changes

antoine-levitt reviewed Dec 19, 2024

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

abussy force-pushed the lobpcg branch from 3088a79 to f2b4899 Compare January 10, 2025 16:41

mfherbst reviewed Jan 11, 2025

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

mfherbst reviewed Jan 11, 2025

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

abussy added 5 commits January 28, 2025 10:52

some minor buffering in LOBPCG

8b77de9

Specifc Hermitian mul() for LazyHcat

080b483

Applied suggested changes

fb7a6e0

clean-up

7f833a4

Move Hermitian() within mul_hermi

ed53658

abussy force-pushed the lobpcg branch from f2b4899 to ed53658 Compare January 28, 2025 16:49

Merge branch 'master' into lobpcg

da93d25

antoine-levitt reviewed Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor optimization of the LOBPCG solver #1037

Minor optimization of the LOBPCG solver #1037

abussy commented Dec 18, 2024

mfherbst commented Dec 19, 2024 •

edited

Loading

mfherbst left a comment

abussy commented Dec 19, 2024

mfherbst commented Dec 21, 2024

abussy commented Jan 10, 2025

mfherbst left a comment

abussy commented Jan 28, 2025

mfherbst commented Jan 31, 2025

antoine-levitt left a comment

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

antoine-levitt Jan 31, 2025

		@@ -100,6 +100,28 @@ end

		Base.:(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj LazyHcat(B)

		# Special case of Hermitian result: can only actively compute the block upper diagonal

Minor optimization of the LOBPCG solver #1037

Are you sure you want to change the base?

Minor optimization of the LOBPCG solver #1037

Conversation

abussy commented Dec 18, 2024

mfherbst commented Dec 19, 2024 • edited Loading

mfherbst left a comment

Choose a reason for hiding this comment

abussy commented Dec 19, 2024

mfherbst commented Dec 21, 2024

abussy commented Jan 10, 2025

mfherbst left a comment

Choose a reason for hiding this comment

abussy commented Jan 28, 2025

mfherbst commented Jan 31, 2025

antoine-levitt left a comment

Choose a reason for hiding this comment

antoine-levitt Jan 31, 2025

Choose a reason for hiding this comment

antoine-levitt Jan 31, 2025

Choose a reason for hiding this comment

antoine-levitt Jan 31, 2025

Choose a reason for hiding this comment

antoine-levitt Jan 31, 2025

Choose a reason for hiding this comment

antoine-levitt Jan 31, 2025

Choose a reason for hiding this comment

antoine-levitt Jan 31, 2025

Choose a reason for hiding this comment

mfherbst commented Dec 19, 2024 •

edited

Loading