Skip to content

30x slowdown in regression #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
conjam opened this issue Dec 6, 2019 · 4 comments
Open

30x slowdown in regression #146

conjam opened this issue Dec 6, 2019 · 4 comments

Comments

@conjam
Copy link

conjam commented Dec 6, 2019

Hey all,

I've found great success using xtensor (and xtensor-blas); when I'm developing, I've seen ~15x speedups when compared to the handwritten stuff I had prior.

Regression is another story though; when I have jobs that use xtensor-blas, I've seen slowdowns as much as 30x when compared to original performance; this slow down is most prominent in smaller unit tests that would pass in under 500ms, and now take about ~17 seconds. Larger tests (10 second run time plus) had a large slowdown of 5x-10x.

I suspect that the problem lies in OpenBLAS as a backend, and I have tried to limit the number of threads spawned by setting OPENBLAS_NUM_THREADS=1, and limiting the number of threads did help, as before I did that my system would crash during regression with pthread resource errors.

Before I spend cycles profiling too deeply, I figured I'd ask: has anyone seen anything similar to this ?

@wolfv
Copy link
Member

wolfv commented Dec 6, 2019

Hi @conjam, first, just in case, have you made sure that you are linking against OpenBLAS or MKL? xtensor-blas contains a C++ implementation (called FLENS) of most BLAS routines, but they are a lot less optimized than actual BLAS.

Also if you could give us a hint on what exactly you're doing with xtensor / xtensor-blas we might be able to help better ... One problem could be that we sometimes need to convert row-major matrices to column-major for some LAPACK operations ... that could eat performance.

@conjam
Copy link
Author

conjam commented Dec 6, 2019

First off: thanks for the quick response!

I've checked and across platforms (I develop on mac, regression on centos) and libopenblas is linked into both binaries; in case that isn't enough, I have add_definitions(-DHAVE_CBLAS=1) and set(XTENSOR_USE_XSIMD 1) in my CMakeLists (I followed the CMake guide y'all put out verbatim).

Currently in regression I only use xt::linalg::dot to find the matrix product of 2D vectors.

@wolfv
Copy link
Member

wolfv commented Dec 18, 2019

Hi @conjam,

can you give me some more context on the slowdown, and especially your matrix / vector sizes?
If you have small matrices, it's very possible that hand-written code outperforms BLAS (e.g. for 3x3 matrix-matrix or matrix-vector product).

You can get some speedup by using xtensor_fixed as a container, however, the BLAS implementation is still "dynamic" and doesn't statically know about the size of your matrices.

If you want to achieve the best performance for dot products for small matrices, I would encourage you to write them by hand and use the xtensor_fixed container.

If you have a problem with large matrices, I would appreciate it if you could give me more context so I can check what the problem might be. E.g. sizes of the matrices, some code snippets, your hand-written implementation etc.

@pdumon
Copy link

pdumon commented Mar 13, 2020

xt::lingalg::tensordot seems to execute very slowly here, not sure if this is related. However, I found this is maybe due to preparatory mathematical & view operations I'm doing. I can influence it by using xt::eval.
Nevertheless, I have two identical algorithms in python-numpy and in C++ (using xtensor-blas) and the C++ version is 50-100x slower than the python-numpy version. The result of the calculation is identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants