-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor optimization of the LOBPCG solver #1037
base: master
Are you sure you want to change the base?
Conversation
Of time to do the Rayleigh-Ritz step ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally very nice, thanks.
In particular mul_hermi
is actually quite a bit of code and I wonder if that's really worth it in terms of performance. After all these operations should be way cheaper than a Hamiltonian-vector product, no ?
Overall, for multiple test systems, I observed a ~2.5 % speedup on the overall timings. While this is not huge by itself, multiple such small optimizations over the code may add up to something significant. |
@abussy I think for |
I cleaned-up this PR to make it the least disruptive possible. I only kept what I think is important, namely the For the As @mfherbst suggested, I am also providing an example. Generally, the speed-up will be the most visible for systems with a large number of electrons (and/or high Ecut). For the aluminium supercell below, I gained about 30s of runtime (on average, out of ~700, on my laptop).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to 4% efficiency gains on the SCF.
If this is generally the case, I'd say it's worth it, but on my quick tests I get slightly less optimistic results. Let's discuss this next week.
Implemented the suggested changes. Requires the removal of the |
This version is fine with me.
This is probably ok, but I recall we added it because of the GPU version (where apparently cuBlas did not check this back when). Maybe you could check whether this is still the case. If cuBlas now also fails loadly, I have no concerns. @antoine-levitt Ok with this PR as it stands ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR, some nits but good to go. Mul_hermi is a generally useful function we should ideally wrap more generally, but the problem is that this is a blind spot for BLAS (there's no BLAS routine for A*B, assuming the result is hermitian; there is for A^T A however, which is called automatically by julia) so it's fine to hardcode something in this case.
@@ -100,6 +100,28 @@ end | |||
|
|||
Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B) | |||
|
|||
# Special case of Hermitian result: can only actively compute the block upper diagonal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe what the function does instead of how it's implemented: something like "Computes A*B, assuming the result is hermitian"
@@ -100,6 +100,28 @@ end | |||
|
|||
Base.:*(Aadj::Adjoint{T,<:LazyHcat}, B::AbstractMatrix) where {T} = Aadj * LazyHcat(B) | |||
|
|||
# Special case of Hermitian result: can only actively compute the block upper diagonal | |||
@views function mul_hermi(Aadj::Adjoint{T,<:LazyHcat}, B::LazyHcat) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aadj -> A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(mul does AB, not AadjB)
Hermitian(ret) | ||
end | ||
|
||
mul_hermi(Aadj::AbstractArray{T}, B::AbstractArray{T}) where {T} = Hermitian(Aadj * B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the generic before the specialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also no need to type them as abstractarray, or get T just do mul_hermi(A,B) = Hermitian(A*B)
mul!(res, Ablock*B, I, α, β) | ||
end | ||
|
||
# Perform a Rayleigh-Ritz for the N first eigenvectors. | ||
@timing function rayleigh_ritz(X, AX, N) | ||
XAX = X' * AX | ||
@assert !any(isnan, XAX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? This is useful to debug cases where A*x is badly implemented and runs into NaNs.
This PR was also motivated by monitoring memory allocations in the built-in DFTK.timer, and the memory allocated by
ortho! X vs Y
in particular.This orthogonalization function has a loop, where potentially large matrix allocations take place at each step. Since these matrices keep the same dimensions, it makes more sense to have a single allocation followed by in-place multiplications.
Additionally, it was noted that quite some time is spent in
rayleigh_ritz
, in a matrix-matrix multiplication known to result in an Hermitian product. Since matrices are in a non-standardLazyHcat
format, no automatic optimization can take place. This PR adds a new functionmul_hermi
forLazyHcat
matrices, which only computes the upper block diagonal, and populates the rest of the function by adding its adjoint. This results in savings of the order of 30%.