Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Catch2 Benchmarking #1723

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

sliwowitz
Copy link
Contributor

@sliwowitz sliwowitz commented May 12, 2022

This is an example of using Catch2 facilities for benchmarking.

Putting this into Draft mode, since it's still WIP. It compiles, runs, but returns a wrong result, and probably also measures stuff we don't really want to measure, but I want this out so others can share their comments.

I had to create another fixture for the benchmarks, based on the earlier KernelExecutionFixture. I thought about inheritance - it didn't work out for me on the first try, but maybe there's a way.

One catch with Catch2 benchmarks is that internally it runs the BENCHMARK marked code many times first to estimate the runtime, and the collect enough data for meaningful statistics (this they call iterations, and it can't be changed without modifying Catch2 sources). This is why my KernelExecutionBenchmarkFixture first sets the memory up (a potentially lengthy operation depending on what we want to measure in the next step) outside the BENCHMARK area. Inside the BENCHMARK, the memory is cleared/memset/whatever, because that part will be re-run multiple times. After resetting the memory, there is a meter.measure([&]{...}); call which encapsulates the part of BENCHMARK that is actually to be measured.

You can build the benchmarks with alpaka_BUILD_BENCHMARK=ON. The executable will live in test/benchmark/rand/randBenchmark. If you run it, it will collect 100 samples that is - it will run each benchmark 100*i times, where i is the number of iterations auto-estimated by Catch2 - it should be something between 1-3. If you just want to see whether the benchmarks run, you can pass a parameter on the command line: test/benchmark/rand/randBenchmark --benchmark-samples=1 (benchmark-samples=1 is also set if running in CI).

Known issues:

  • I'm likely mishandling the input/output parameters so the results (marked debug temp in the output) which should all be around 0.5 are actually not.
  • The fixture is hardcoded to just use a single float to communicate any data back to the test's cpp.
  • For the benchmark to be meaningful, we should also find a good way to set up the WorkDiv according to the accelerator we're using.
  • CI isn't yet building/running the benchmarks.

Comment on lines +14 to +20
#if defined(ALPAKA_ACC_GPU_CUDA_ENABLED) && !BOOST_LANG_CUDA
# error If ALPAKA_ACC_GPU_CUDA_ENABLED is set, the compiler has to support CUDA!
#endif

#if defined(ALPAKA_ACC_GPU_HIP_ENABLED) && !BOOST_LANG_HIP
# error If ALPAKA_ACC_GPU_HIP_ENABLED is set, the compiler has to support HIP!
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike those. Can't we just have a prelude in alpaka.hpp after BoostPredef that checks those in one place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it takes ALPAKA_HOST_ONLY into account.


namespace alpaka::test
{
//! The fixture for executing a kernel on a given accelerator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! The fixture for executing a kernel on a given accelerator.
//! The fixture for benchmarking the execution of a kernel on a given accelerator.

@sliwowitz
Copy link
Contributor Author

sliwowitz commented May 19, 2022

About the fixture - I don't think we can provide a universal benchmark fixture as we discussed earlier - i.e. one that would execute the kernel and pass some pre-allocated buffers which were set up in the user code benchmark cpp code (i.e. RandBenchmarkKernel).

The issue is two-fold:

  1. we need devAcc and devHost which are now initialized inside the KernelExecutionBenchmarkFixture and we'd have to pass these to the user object RandBenchmarkKernel
    • of course we can pass them along, but then the KernelExecutionBenchmarkFixture is basiclly just doing setUp -> measure -> tearDown
  2. if we store the buffers inside RandBenchmarkKernel our KernelExecutionBenchmarkFixture isn't really a fixture, since the data is actually held in RandBenchmarkKernel.
    • it might be just an issue of terminology, but I feel it points to a code smell

@j-stephan
Copy link
Member

Are you still working on this @sliwowitz?

@sliwowitz
Copy link
Contributor Author

Yes. I got stuck on the KernelExecutionBenchmarkFixture idea. I wanted to make it a general object usable for other benchmarks than the simple example benchmark, but it's still unclear to me how to handle arbitrary inputs/outputs. I'll rebase on develop, and take a look at this next week.

@SimeonEhrig
Copy link
Member

I checked the output options again. Last time, we had the problem, that the output was not machine readable but I found some documentation about the usage of the reporter: https://github.com/catchorg/Catch2/blob/devel/docs/reporters.md

I tested your benchmark with XML reporter:

$ build/ninja-omp2b-gcc-release/test/benchmark/rand/randBenchmark --reporter XML
<?xml version="1.0" encoding="UTF-8"?>
<Catch2TestRun name="randBenchmark" rng-seed="645286256" xml-format-version="2" catch2-version="3.3.2">
  <TestCase name="defaultRandomGeneratorBenchmark" tags="[randBenchmark]" filename="/home/simeon/projects/alpaka/test/benchmark/rand/src/randBenchmark.cpp" line="53">
    <BenchmarkResults name="Random sequence N=10" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="8.6125e+06">
      <!-- All values in nano seconds -->
      <mean value="89822.5" lowerBound="85849.8" upperBound="103189" ci="0.95"/>
      <standardDeviation value="33361.6" lowerBound="10991" upperBound="75389.8" ci="0.95"/>
      <outliers variance="0.98889" lowMild="2" lowSevere="0" highMild="2" highSevere="2"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=100000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="7.2092e+06">
      <!-- All values in nano seconds -->
      <mean value="131106" lowerBound="97376" upperBound="287445" ci="0.95"/>
      <standardDeviation value="317164" lowerBound="12744.5" upperBound="753666" ci="0.95"/>
      <outliers variance="0.989974" lowMild="0" lowSevere="0" highMild="0" highSevere="2"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=1000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.53628e+07">
      <!-- All values in nano seconds -->
      <mean value="229560" lowerBound="223253" upperBound="240870" ci="0.95"/>
      <standardDeviation value="41958.1" lowerBound="25405.3" upperBound="79203.3" ci="0.95"/>
      <outliers variance="0.935867" lowMild="11" lowSevere="0" highMild="0" highSevere="1"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=10000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.02668e+08">
      <!-- All values in nano seconds -->
      <mean value="1.57844e+06" lowerBound="1.32217e+06" upperBound="2.17723e+06" ci="0.95"/>
      <standardDeviation value="1.87999e+06" lowerBound="702312" upperBound="3.27425e+06" ci="0.95"/>
      <outliers variance="0.989892" lowMild="0" lowSevere="0" highMild="1" highSevere="3"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=100000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.00224e+09">
      <!-- All values in nano seconds -->
      <mean value="1.02198e+07" lowerBound="1.01508e+07" upperBound="1.03973e+07" ci="0.95"/>
      <standardDeviation value="515800" lowerBound="116904" upperBound="994951" ci="0.95"/>
      <outliers variance="0.484665" lowMild="2" lowSevere="0" highMild="1" highSevere="2"/>
    </BenchmarkResults>
    <BenchmarkResults name="Random sequence N=1000000000" samples="100" resamples="100000" iterations="1" clockResolution="32.4883" estimatedDuration="1.10758e+10">
      <!-- All values in nano seconds -->
      <mean value="1.04739e+08" lowerBound="1.03648e+08" upperBound="1.06501e+08" ci="0.95"/>
      <standardDeviation value="6.91494e+06" lowerBound="4.89068e+06" upperBound="9.9287e+06" ci="0.95"/>
      <outliers variance="0.625317" lowMild="2" lowSevere="0" highMild="0" highSevere="19"/>
    </BenchmarkResults>
    <OverallResult success="true" skips="0">
      <StdOut>
Hardware threads: 64

temp debug normalized result = 18.7131 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 18.7981 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 9.672 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 1.64295 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 0.623814 should probably converge to 0.5.Hardware threads: 64

temp debug normalized result = 0.500023 should probably converge to 0.5.
      </StdOut>
    </OverallResult>
  </TestCase>
  <OverallResults successes="6" failures="0" expectedFailures="0" skips="0"/>
  <OverallResultsCases successes="1" failures="0" expectedFailures="0" skips="0"/>
</Catch2TestRun>

There is also a JSON reporter, but therefore we need to update catch2 (only a new minor version): catchorg/Catch2#2706

@sliwowitz
Copy link
Contributor Author

I'd vote for the JSON reporter as it could make the output both machine- and human-readable :-)

@SimeonEhrig
Copy link
Member

I'd vote for the JSON reporter as it could make the output both machine- and human-readable :-)

In general, I also prefer JSON because it is more readable. But we should do at least a short test, if XML and JSON provides the same amount of information. For example, the XML output uses comments to store the information that the time was measured in nano seconds.

@psychocoderHPC
Copy link
Member

JSON reporter is currently not working. It does not contain the benchmark results, the reporter is currently experimental and not fully implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants