|
| 1 | +% Performance counters analysis for Hyper-Threading |
| 2 | +% Beatrice Bevilacqua, Anxhelo Xhebraj |
| 3 | +% March 2019 |
| 4 | + |
| 5 | +Performance counters analysis for Hyper-Threading |
| 6 | +================================================= |
| 7 | + |
| 8 | + |
| 9 | +Performance Counters Frameworks |
| 10 | +------------------------------- |
| 11 | +The complexity of newer architectures has led to the necessity of |
| 12 | +a better knowledge of the underlying hardware in order to get peak |
| 13 | +performance. Following these trends new interfaces have been made |
| 14 | +available to developers for spotting performance bottlenecks in their |
| 15 | +applications such as Performance Monitoring Units (PMU). |
| 16 | + |
| 17 | +PMUs enable developers to observe and count events in the CPU such as |
| 18 | +branch mispredictions, cache misses and other finer grained details over |
| 19 | +the whole pipeline. Although powerful, dealing with such information |
| 20 | +remains burdensome given the diversity of the events, making it difficult |
| 21 | +to truly identify optimization opportunities. |
| 22 | +Depending by the processor family, on average 4 counters can be read |
| 23 | +contemporarily at any time using Model Specific Registers. In order |
| 24 | +to read more than 4 events, various tools multiplex such registers |
| 25 | +in a *time-sharing* fashion. |
| 26 | + |
| 27 | +Many tools for performance analysis based on PMUs have been developed |
| 28 | +ranging from *raw* event count to more sofisticated and aggregated |
| 29 | +measures as follows: |
| 30 | + |
| 31 | +* `msr`: direct access to the device files `/dev/cpu/*/msr` |
| 32 | +* [PAPI] : A Performance Application Programming Interface that |
| 33 | + offers a set of APIs for using performance counters. |
| 34 | + Supports multiple architectures and multiplexing. |
| 35 | +* [likwid] : A suite of applications and libraries for analysing |
| 36 | + High Performance Computing applications. It |
| 37 | + contains out of the box utilies to work with MPI, |
| 38 | + power profiling and architecture topology. |
| 39 | +* [Intel Vtune Amplifier] : Application for performance analysis on |
| 40 | + intel architectures. Gives insights regarding possible bottlenecks |
| 41 | + of the application annotating its source code and provides |
| 42 | + possible solutions. |
| 43 | +* [perf] : In a similar vein to Intel Vtune Amplifier shows which |
| 44 | + functions are more critical to the application. Additionally |
| 45 | + provides more high level information such as I/O and Networking. |
| 46 | + It is possible to analyse raw hardware performance counters but |
| 47 | + its main goal is abstracting over them. |
| 48 | +* [pmu-tools] : is a collection of tools for profile collection |
| 49 | + and performance analysis on Intel CPUs on top of Linux perf |
| 50 | + |
| 51 | + |
| 52 | +`likwid` |
| 53 | +-------- |
| 54 | +Given that the goal of this document is to analyze system behaviour |
| 55 | +through performance counters to provide insights regarding new |
| 56 | +possible scheduling strategies in Hyper-Threading systems, we choose |
| 57 | +to use the `likwid` applications and libraries for our task. The choice |
| 58 | +was especially driven by the presence of useful benchmarks in the `likwid` |
| 59 | +repository for stressing FPU and other core subsystems. Additionally |
| 60 | +Intel Vtune Amplifier was used to profile the benchmarks in order to |
| 61 | +characterize their workload. |
| 62 | + |
| 63 | +`likwid-perfctr -e` allows to query all the available events for |
| 64 | +the current architecture while `likwid-perfctr -a` shows the pre-configured |
| 65 | +event sets, called performance groups, with useful pre-selected event |
| 66 | +sets and derived metrics. Multiple modes of execution of performance monitoring |
| 67 | +are available as documented in the `likwid` wiki. Of main interest are |
| 68 | +**wrapper mode** and **timeline mode**. The former produces a summary of the |
| 69 | +events, while the latter outputs performance metrics at a specified |
| 70 | +frequency (specified through the `-t` flag). |
| 71 | +In case multiple groups need to be monitored multiplexing is performed |
| 72 | +at the granularity set through the `-t` flag (in timeline mode, otherwise |
| 73 | +`-T` for wrapper mode) and the output produced are the id of the group read |
| 74 | +at a given timestep and its values. |
| 75 | + |
| 76 | + >Tests have shown that for measurements below 100 milliseconds, the |
| 77 | + periodically printed results are not valid results anymore (they are higher |
| 78 | + than expected) but the behavior of the results is still valid. E.g. if you |
| 79 | + try to resolve the burst memory transfers, you need results for small |
| 80 | + intervals. The memory bandwidth for each measurement may be higher than |
| 81 | + expected (could even be higher than the theoretical maximum of the machine) |
| 82 | + but the burst and non-burst traffic is clearly identifiable by highs and |
| 83 | + lows of the memory bandwidth results. |
| 84 | + |
| 85 | + |
| 86 | +Benchmarks |
| 87 | +---------- |
| 88 | + |
| 89 | +The benchmark available in `likwid` can be run through the `likwid-bench` |
| 90 | +command. For an overview of the available benchmarks run `likwid-bench -a`. |
| 91 | +All benchmarks perform operations over one-dimensional arrays. The benchmarks |
| 92 | +used in our setting are: |
| 93 | + |
| 94 | + * `ddot_sp`: Single-precision dot product of two vectors, only scalar |
| 95 | + operations |
| 96 | + * `copy`: Double-precision vector copy, only scalar operations |
| 97 | + * `ddot_sp_avx`: Single-precision dot product of two vectors, optimized for AVX |
| 98 | + * `sum_int`: Custom benchmark similar to `sum` but working on integers |
| 99 | + |
| 100 | +All benchmarks are run with multiple configurations of number of threads (with or |
| 101 | +without Hyper-Threading), processor frequencies with TurboBoost disabled, working |
| 102 | +set size. The latter is needed in order to emulate *core-bound* executions |
| 103 | +(working set fitting in cache) and *memory-bound* ones. |
| 104 | + |
| 105 | + |
| 106 | +Details |
| 107 | +------- |
| 108 | + |
| 109 | +The tests were run on a Dell XPS 9750 with i7-8750H. With TurboBoost disabled |
| 110 | +the available frequencies range from 1.0 to 2.2 GHz. There is one socket with |
| 111 | +6 Physical cores and 12 Logical cores (in Hyper Threading). |
| 112 | + |
| 113 | +[PAPI]: http://icl.utk.edu/papi/ |
| 114 | +[PAPI]: https://bitbucket.org/icl/papi.git |
| 115 | +[PAPI]: http://icl.utk.edu/projects/papi/wiki/PAPIC:Overview |
| 116 | + |
| 117 | +[likwid]: https://github.com/RRZE-HPC/likwid |
| 118 | + |
| 119 | +[Intel Vtune amplifier]: https://software.intel.com/en-us/vtune |
| 120 | + |
| 121 | +[perf]: http://www.brendangregg.com/perf.html |
| 122 | + |
| 123 | +[pmu-tools]: https://github.com/andikleen/pmu-tools |
0 commit comments