Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate parallelization logic from Dslash classes #97

Open
martin-ueding opened this issue Jul 21, 2017 · 2 comments
Open

Separate parallelization logic from Dslash classes #97

martin-ueding opened this issue Jul 21, 2017 · 2 comments

Comments

@martin-ueding
Copy link
Contributor

All four Dslash classes have very similar paralleization and communication logic in them, various #pragma omp directives and hundreds of lines which only do array and thread index calculations. This is completely independent of the actual physical Dirac operator (that perhaps is a better name for the Dslash classes). The merge of devel into hacklatt-strongscale branch showed that the identical changes were made for Wilson and clover and Dslash and achimbdpsi, therefore this code should go somewhere else.

One of the reasons that the hacklatt-strongscale branch was not merged in four months ago supposedly was that it does not improve the performance in all situations, right? So what we really need here is that one can simply exchange the messaging model from the old queues to the hacklatt-strongscale model, perhaps by interchanging the concrete implementation of an interface (abstract base class).

This refactoring would make it much easier to port the TM Wilson and TM clover operators to the new communication model. Right now the quick-fix would be to re-do the changes in more methods:

  1. TM Wilson Dslash
  2. TM Wilson achimbdpsi
  3. TM clover Dslash
  4. TM clover achimbdpsi

Since this is a major change, we should land all other feature branches before we do so to avoid painful merges.

@kostrzewa
Copy link
Contributor

As far as I can tell, the strongscale branch is faster in a few situations, but slower in many others. Also, at least in my tests, there were lots of deadlocks so it's certainly not ready to replace the current comms model. Splitting forward and backward face completion is probably a good idea though in any case.

The other problem is that the dslash and the communication code get intertwined in complicated ways when you want to ensure full overlap of computation and communication. Requiring the availability of both receive queues (which are great on many machines, as far as I can tell) and the ability to have a single or some thread(s) explicitly progress the comms (by spinning on MPI_Wait) makes the abstraction even harder to come up with (not impossible though).

Another difficulty is that performance improvements will almost certainly require moving thread dispatch further up in the hierarchy. This in turn intertwines thread and MPI barriers, which is another aspect which needs to be taken care of.

In my offload_comms branch, I've moved thread dispatch to the operator level (outside of dslash and dslashAChiMinusBDPsi). I have one situation for which I was able to improve performance by more than 30% at the far end of the strong-scaling window on KNL+OPA (Marconi A2). However, in some other situations I get (mild) performance regressions. I also still have unpredictable crashers, probably because I need another MPI barrier in the operator.

@bjoo
Copy link
Contributor

bjoo commented Nov 2, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants