Skip to content

Commit e90fc0b

Browse files
author
Angelogeb
committed
Added report
0 parents  commit e90fc0b

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed

report.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
% Performance counters analysis for Hyper-Threading
2+
% Beatrice Bevilacqua, Anxhelo Xhebraj
3+
% March 2019
4+
5+
Performance counters analysis for Hyper-Threading
6+
=================================================
7+
8+
9+
Performance Counters Frameworks
10+
-------------------------------
11+
The complexity of newer architectures has led to the necessity of
12+
a better knowledge of the underlying hardware in order to get peak
13+
performance. Following these trends new interfaces have been made
14+
available to developers for spotting performance bottlenecks in their
15+
applications such as Performance Monitoring Units (PMU).
16+
17+
PMUs enable developers to observe and count events in the CPU such as
18+
branch mispredictions, cache misses and other finer grained details over
19+
the whole pipeline. Although powerful, dealing with such information
20+
remains burdensome given the diversity of the events, making it difficult
21+
to truly identify optimization opportunities.
22+
Depending by the processor family, on average 4 counters can be read
23+
contemporarily at any time using Model Specific Registers. In order
24+
to read more than 4 events, various tools multiplex such registers
25+
in a *time-sharing* fashion.
26+
27+
Many tools for performance analysis based on PMUs have been developed
28+
ranging from *raw* event count to more sofisticated and aggregated
29+
measures as follows:
30+
31+
* `msr`: direct access to the device files `/dev/cpu/*/msr`
32+
* [PAPI] : A Performance Application Programming Interface that
33+
offers a set of APIs for using performance counters.
34+
Supports multiple architectures and multiplexing.
35+
* [likwid] : A suite of applications and libraries for analysing
36+
High Performance Computing applications. It
37+
contains out of the box utilies to work with MPI,
38+
power profiling and architecture topology.
39+
* [Intel Vtune Amplifier] : Application for performance analysis on
40+
intel architectures. Gives insights regarding possible bottlenecks
41+
of the application annotating its source code and provides
42+
possible solutions.
43+
* [perf] : In a similar vein to Intel Vtune Amplifier shows which
44+
functions are more critical to the application. Additionally
45+
provides more high level information such as I/O and Networking.
46+
It is possible to analyse raw hardware performance counters but
47+
its main goal is abstracting over them.
48+
* [pmu-tools] : is a collection of tools for profile collection
49+
and performance analysis on Intel CPUs on top of Linux perf
50+
51+
52+
`likwid`
53+
--------
54+
Given that the goal of this document is to analyze system behaviour
55+
through performance counters to provide insights regarding new
56+
possible scheduling strategies in Hyper-Threading systems, we choose
57+
to use the `likwid` applications and libraries for our task. The choice
58+
was especially driven by the presence of useful benchmarks in the `likwid`
59+
repository for stressing FPU and other core subsystems. Additionally
60+
Intel Vtune Amplifier was used to profile the benchmarks in order to
61+
characterize their workload.
62+
63+
`likwid-perfctr -e` allows to query all the available events for
64+
the current architecture while `likwid-perfctr -a` shows the pre-configured
65+
event sets, called performance groups, with useful pre-selected event
66+
sets and derived metrics. Multiple modes of execution of performance monitoring
67+
are available as documented in the `likwid` wiki. Of main interest are
68+
**wrapper mode** and **timeline mode**. The former produces a summary of the
69+
events, while the latter outputs performance metrics at a specified
70+
frequency (specified through the `-t` flag).
71+
In case multiple groups need to be monitored multiplexing is performed
72+
at the granularity set through the `-t` flag (in timeline mode, otherwise
73+
`-T` for wrapper mode) and the output produced are the id of the group read
74+
at a given timestep and its values.
75+
76+
>Tests have shown that for measurements below 100 milliseconds, the
77+
periodically printed results are not valid results anymore (they are higher
78+
than expected) but the behavior of the results is still valid. E.g. if you
79+
try to resolve the burst memory transfers, you need results for small
80+
intervals. The memory bandwidth for each measurement may be higher than
81+
expected (could even be higher than the theoretical maximum of the machine)
82+
but the burst and non-burst traffic is clearly identifiable by highs and
83+
lows of the memory bandwidth results.
84+
85+
86+
Benchmarks
87+
----------
88+
89+
The benchmark available in `likwid` can be run through the `likwid-bench`
90+
command. For an overview of the available benchmarks run `likwid-bench -a`.
91+
All benchmarks perform operations over one-dimensional arrays. The benchmarks
92+
used in our setting are:
93+
94+
* `ddot_sp`: Single-precision dot product of two vectors, only scalar
95+
operations
96+
* `copy`: Double-precision vector copy, only scalar operations
97+
* `ddot_sp_avx`: Single-precision dot product of two vectors, optimized for AVX
98+
* `sum_int`: Custom benchmark similar to `sum` but working on integers
99+
100+
All benchmarks are run with multiple configurations of number of threads (with or
101+
without Hyper-Threading), processor frequencies with TurboBoost disabled, working
102+
set size. The latter is needed in order to emulate *core-bound* executions
103+
(working set fitting in cache) and *memory-bound* ones.
104+
105+
106+
Details
107+
-------
108+
109+
The tests were run on a Dell XPS 9750 with i7-8750H. With TurboBoost disabled
110+
the available frequencies range from 1.0 to 2.2 GHz. There is one socket with
111+
6 Physical cores and 12 Logical cores (in Hyper Threading).
112+
113+
[PAPI]: http://icl.utk.edu/papi/
114+
[PAPI]: https://bitbucket.org/icl/papi.git
115+
[PAPI]: http://icl.utk.edu/projects/papi/wiki/PAPIC:Overview
116+
117+
[likwid]: https://github.com/RRZE-HPC/likwid
118+
119+
[Intel Vtune amplifier]: https://software.intel.com/en-us/vtune
120+
121+
[perf]: http://www.brendangregg.com/perf.html
122+
123+
[pmu-tools]: https://github.com/andikleen/pmu-tools

0 commit comments

Comments
 (0)