Skip to content

Bartek's BGQ Scalasca Runs

kostrzewa edited this page Sep 11, 2012 · 49 revisions

Study of sample files with scalasca on BG/Q

It is difficult to estimate the effect of not OpenMP parallelizing a given function. This is especially true when trying to compare what happens on Intel machines with at most 4 threads and BG/Q with 64 threads. For instance, simple functions like deriv_Sb or update_backward_gauge get a clear performance hit from OpenMP on Intel. It is unclear, however, what the situation will be like when 1 thread suddenly has to compute deriv_Sb for a 64 thread volume! This is even more true for sw_all.

For this reason I will be doing a systematic set of runs of the various hmc sample files and ompnumthreads=64. I want to identify further avenues for optimization, especially in light of the amount of work required to get OpenMP parallelization working for functions with reductions and those where threads could potentially write into the same memory location.

In each of these runs the global volume is 48x24^3 and bg_size=32 and 4 trajectories are computed.

HMC3, 32_64

deriv_Sb

In this run it is striking that more than 20 percent of the total wallclock time is spent in deriv_Sb. This is because there is no OpenMP for deriv_Sb currently as the threads would write competitively into the same memory locations. Also, more than 6% of thread idling is due to deriv_Sb. Another 22% of thread idling comes, also from deriv_Sb but in a different location.

update_guage

More than 10% of the time is spent in update_gauge because there is no OpenMP there. Here, 12% of thread idling is localized! Between the two of them, update_gauge and deriv_Sb account for over 40% of thread idling!

deriv_Sb threaded

Implementing _trace_lambda_mul_add_assign with "tm_atomic" pragmas gives a factor 15 improvement for deriv_Sb and a total runtime saving of about 20 percent! (from 17 to 14 minutes)

square_norm

The non-OpenMP routines square_norm, scalar_prod, assign_add_mul_r and assign_mul_add_r account for a total of a little more than 10%. These need to be parallelized, but with lower priority than update_gauge, say.

square_norm threaded

The improvement for square_norm is excellent with a speedup of close to 40 and almost no management or idling overhead! Total runtime decreased to 11 minutes.

sclar_prod threaded

The improvement here is similar to that above.

update_gauge threaded

The next target of opportunity is update_gauge which now takes up almost 15 percent of the total wallclock time. The problem here is that we hit a bug [1], sort of confirmed by Dirk Pleiter, that this sort of local multithreaded update seemed to fail for some reason... and it still fails!!!

[1] https://github.com/etmc/tmLQCD/issues/121

Adding OpenMP parallelism to update_gauge gives a factor 25 speedup... so this is a roadblock because it doesn't provide correct results. I've tried adding ALIGN qualifiers to no avail. The next step will be to take it apart bit by bit to see what exactly is causing the problem!

HMC3, 512_4

After a great number of tests it seems that between 32_64 and 512_4, 512_4 is the clear performance winner for HMC3. It is of course possible that 128_16 or 256_8 could perform better, but this has not yet been tested!

This scalasca run with all the important functions OpenMP-parallelized except for update_gauge shows that in this configuration thread-management seems to be a non-issue. Thread-idling occurs only during MPI waits and in the non-parallelized update_gauge, adding up to only 10 percent of the total time. Thread management and MPI management account for about only 5% of the total time each! (By contrast with 64 threads, thread management takes up close to 40% of our total time, leading to severe thread idling!)

We are very close to being as good as we can be given the constraints of our software.

benchmark, 512_4 / 32_64

In 512_4 thread idling is almost a non-issue, strangely enough though, it does affect the "nocomm" hopping matrix. This could be a side-effect of the "omp single" section which in the nocomm version of the hopping matrix will serve only as a barrier, with no threads launched inside. By contrast, in the version with communcation, MPI will be operating threads. This particular measurement should thus be taken with a grain of salt.

In the 512_4 version OpenMP management overhead is in the 4% region while MPI accounts for about 6%. In contrast, in the 32_64 version OpenMP management accounts for about 20% of total time! MPI on the other hand, costs less than one percent of total time here.

It must be noted that the scalasca overhead affects the readings here because without scalasca, the 32_64 benchmark is substantially faster than the 512_4 one. (401 vs 360 Mflops per thread)

HMC-TMCLOVERDET, 32_64

This sample traditionally has a very strong contribution from sw_all. Let's see whether we get errors from the potential write conflicts with 64 threads and whether the tm_atomic implementation fixes that and is sufficiently fast.

Building and running with scalasca on BG/Q

To compile an instrumented executable:

module load UNITE scalasca
cd builddir
export SKIN_MODE=none  #otherwise the compiler doesn't work during configure
../configure [...] CC="skin $XLCDIR/mpixlc_r"
unset SKIN_MODE #when compiling we want skinning to work, of course
make

To run an instrumented executable and generate an epik experiment:

# @ job_name         = BGQ_hmc3_hybrid_32_64_hs_scalasca
# @ error            = $(job_name).$(jobid).out
# @ output           = $(job_name).$(jobid).out
# @ environment      = COPY_ALL;
# @ wall_clock_limit = 00:15:00
# @ notification     = always
# @ notify_user      = [email protected]
# @ job_type         = bluegene
# @ bg_connectivity  = TORUS
# @ bg_size          = 32
# @ queue

module load UNITE scalasca

export NAME=BGQ_hmc3_hybrid_32_64_hs_scalasca
export OMP_NUM_THREADS=64
export ESD_BUFFER_SIZE=10000000 # this overflows with many MPI processes, so set to large number


if [[ ! -d ${WORK}/${NAME} ]]
then
  mkdir -p ${WORK}/${NAME}
fi

cd ${WORK}/${NAME}

cp /homea/hch02/hch028/code/tmLQCD.urbach/build_bgq_hybrid/hmc3_32_64.input ${WORK}/${NAME}/hmc.input

scan runjob --np 32 "--ranks-per-node 1" "--cwd ${WORK}/${NAME}" -- ${HOME}/code/tmLQCD.urbach/build_bgq_hybrid_hs_scalasca/hmc_tm

To analyze an epik experiment:

module load UNITE scalasca
cd $WORK/$NAME
square epik*

16 thread runs on Intel in Zeuthen

Since the BGQ is so full I'm approximating the effect on BG/Q by unsing the Zeuthen wgs and running a pure OpenMP version of the code. Early in the morning and after work hours one get get good measurements here. The machine supports 16 concurrent threads so that's nice.

Given the modest cache size of 8MB per CPU I will be using a local volume of 8^4 only. The raw benchmark result is about 17500 Mflops for the halfspinor version. The CG for HMC3 gives around 11500 Mflops.

no optimization

The local volume being smaller and the number of threads lower, the effect of deriv_Sb and update_gauge are not nearly as pronounced as on BG/Q. We have contributions of about 10 and 6 percent to total time spent and idling threads from update_gauge and deriv_Sb respectively.

There are contributions of about 3-4 percent each from square_norm and scalar_prod.

update_gauge threaded

The effect here is very good with a speedup of about 14 to the total time spent in update_gauge.

deriv_Sb threaded

OpenMP in deriv_Sb brings a factor 10 improvement. Clearly, the atomic statements in _trace_lambda_mul_add_assign have some overhead, but it is manageable!