Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor optimization of the LOBPCG solver #1037

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

abussy
Copy link
Contributor

@abussy abussy commented Dec 18, 2024

This PR was also motivated by monitoring memory allocations in the built-in DFTK.timer, and the memory allocated by ortho! X vs Y in particular.

This orthogonalization function has a loop, where potentially large matrix allocations take place at each step. Since these matrices keep the same dimensions, it makes more sense to have a single allocation followed by in-place multiplications.

Additionally, it was noted that quite some time is spent in rayleigh_ritz, in a matrix-matrix multiplication known to result in an Hermitian product. Since matrices are in a non-standard LazyHcat format, no automatic optimization can take place. This PR adds a new function mul_hermi for LazyHcat matrices, which only computes the upper block diagonal, and populates the rest of the function by adding its adjoint. This results in savings of the order of 30%.

@mfherbst
Copy link
Member

mfherbst commented Dec 19, 2024

This results in savings of the order of 30%.

Of time to do the Rayleigh-Ritz step ?

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally very nice, thanks.

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
@abussy
Copy link
Contributor Author

abussy commented Dec 19, 2024

In particular mul_hermi is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?

Overall, for multiple test systems, I observed a ~2.5 % speedup on the overall timings. While this is not huge by itself, multiple such small optimizations over the code may add up to something significant.

@mfherbst
Copy link
Member

@abussy I think for mul_hermi an example showing a clear difference in timings would be good if you happen to have one handy. Just for @antoine-levitt to get convinced whether the extra code is worth it 😄.

@abussy
Copy link
Contributor Author

abussy commented Jan 10, 2025

I cleaned-up this PR to make it the least disruptive possible. I only kept what I think is important, namely the mul_hermi() function for the product of LazyHcat matrices known to be Hermitian, and a redefinition of mul!(A, B, C, alpha, beta). In the latter case, there was an explicit MM with the identity (clearly a waste).

For the mul_hermi() function, I observe up to 4% efficiency gains on the SCF. For a single additional function of 20 lines, I think it's a bargain. Making small steps like this can eventually lead to significant gains. That's my opinion at least.

As @mfherbst suggested, I am also providing an example. Generally, the speed-up will be the most visible for systems with a large number of electrons (and/or high Ecut). For the aluminium supercell below, I gained about 30s of runtime (on average, out of ~700, on my laptop).

using DFTK                                                                                           
using PseudoPotentialData                                                                            
setup_threading()                                                                                    
                                                                                                     
Ecut = 32.0                                                                                          
kgrid = [1, 1, 1]                                                                                    
maxiter = 10                                                                                         
tol = 1.0e-8                                                                                         
                                                                                                     
factor = 4                                                                                           
a = 3.8267                                                                                           
lattice = factor*a * [[0.0 1.0 1.0];                                                                 
                      [1.0 0.0 1.0];                                                                 
                      [1.0 1.0 0.0]]                                                                 
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))                            
atoms = [Al for i in 1:factor^3]                                                                     
positions = Vector{Vector{Float64}}([])                                                              
for i = 1:factor, j = 1:factor, k=1:factor                                                           
   push!(positions, [i/factor, j/factor, k/factor])                                                  
end                                                                                                  
                                                                                                     
model = model_DFT(lattice, atoms, positions; temperature=1e-4,                                       
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())                              
                                                                                                     
#actual calculations                                                                                 
DFTK.reset_timer!(DFTK.timer)                                                                        
basis  = PlaneWaveBasis(model; Ecut, kgrid)                                                          
scfres = self_consistent_field(basis; maxiter=maxiter, tol=tol);                                     
@show DFTK.timer

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to 4% efficiency gains on the SCF.

If this is generally the case, I'd say it's worth it, but on my quick tests I get slightly less optimistic results. Let's discuss this next week.

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
@abussy
Copy link
Contributor Author

abussy commented Jan 28, 2025

Implemented the suggested changes. Requires the removal of the @assert !any(isnan, XAX) in rayleigh_ritz() before the diagonalization (because of disallowed scalar access on GPU, when XAX is already wrapped as Hermitian). I believe this is safe, as the eigen() function fails loudly with LoadError: ArgumentError: matrix contains Infs or NaNs when NaNs are present.

@mfherbst
Copy link
Member

This version is fine with me.

Requires the removal of the @assert !any(isnan, XAX)

This is probably ok, but I recall we added it because of the GPU version (where apparently cuBlas did not check this back when). Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns.

@antoine-levitt Ok with this PR as it stands ?

Copy link
Member

@antoine-levitt antoine-levitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, some nits but good to go. Mul_hermi is a generally useful function we should ideally wrap more generally, but the problem is that this is a blind spot for BLAS (there's no BLAS routine for A*B, assuming the result is hermitian; there is for A^T A however, which is called automatically by julia) so it's fine to hardcode something in this case.

@@ -100,6 +100,28 @@ end

Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

# Special case of Hermitian result: can only actively compute the block upper diagonal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what the function does instead of how it's implemented: something like "Computes A*B, assuming the result is hermitian"

@@ -100,6 +100,28 @@ end

Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B)

# Special case of Hermitian result: can only actively compute the block upper diagonal
@views function mul_hermi(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aadj -> A

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(mul does AB, not AadjB)

Hermitian(ret)
end

mul_hermi(Aadj::AbstractArray{T}, B::AbstractArray{T}) where {T} = Hermitian(Aadj * B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the generic before the specialization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also no need to type them as abstractarray, or get T just do mul_hermi(A,B) = Hermitian(A*B)

mul!(res, Ablock*B, I, α, β)
end

# Perform a Rayleigh-Ritz for the N first eigenvectors.
@timing function rayleigh_ritz(X, AX, N)
XAX = X' * AX
@assert !any(isnan, XAX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This is useful to debug cases where A*x is badly implemented and runs into NaNs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants