-
Notifications
You must be signed in to change notification settings - Fork 0
/
prerequisites.tex
247 lines (206 loc) · 10.3 KB
/
prerequisites.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
\section{Prerequisites}
Before deep diving into performance analysis with {\em perf} two important
questions have to be answered before proceeding. The are two compiler
flags that make the life of a performance analyst difficult. Missing
debug information and the omitted frame pointer.
Why these two options are critical in the performance analyst point of
view is shown in the next two sections.
\subsection[ssec:dsymbol]{Debug Symbols}
A debug symbols is additional information stored in the binary or in
another file that expresses information about a programming language
construct (variable, function, ...) that is mapped to binary code.
A symbolic debugger can gain access to names of variable or function
names from the source code of the binary. Additionally the debugger
can inspect variables in memory, step into and step over these
language constructs in the source code with this additional
information.
\subsubsection{Where are debug symbols saved?}
For the sake of simplicity the focus will be on \infull{ELF} (\ELF\)
binaries. Every binary has several sections e.g. the {\tt .text} section
holds the executable code, {\tt .bss} section holds statically
allocated variables. Another interesting section is {\tt .eh_frame}
that in the first place held information about exception handling in
{\tt C++}. Nowadays this section is used for more then just
exceptions handling. Unwinding the stack, back tracing are just a few
examples how this section is used additionally.
Beware that even if C-code is compiled without exceptions the
{\tt .eh_frame} is still present and {\em needed} for reasons stated above.
To have a look at the {\tt .eh_frame} section one can use the {\em readelf}
program.
\margintext{DWARF-2 Call Frame Information, additionally one can use
{\tt dwarfdump} or {\tt objdump}}
\starttyping
# readelf --debug-dump=frames /bin/ls
Contents of the .eh_frame section:
00000000 0000000000000014 00000000 CIE
Version: 1
Augmentation: "zR"
Code alignment factor: 1
Data alignment factor: -8
Return address column: 14
Augmentation data: 1b
DW_CFA_def_cfa: r15 ofs 160
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
00000018 0000000000000014 0000001c FDE cie=00000000 pc=000000000000656
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
\stoptyping
The { \tt .eh_frame} has a \infull{CIE} (\CIE\) and for every frame a
\infull{FDE} (\FDE\). With this additional info a debugger can unwind
the stack even if a the frame pointer is omitted. See next section for
frame pointer issues.
Each section in a binary has additional flags that describe several
features of the corresponding section. Among these flags several of
the flags are {\tt ALLOC}, {\tt LOAD}, or {\tt READONLY}. The {\tt
ALLOC} and {\tt LOAD} flags indicate that this section is allocated
and loaded during runtime.
If source code is compiled with the {\tt -g} compiler flag additional
sections will be created. The default debugging format nowadays is the
\infull{DWARF} (\DWARF\) debugging format (dwarfstd.org). Inspecting
the binary with debug symbols one can see different \DWARF\ sections
that make up the \DWARF\ data. The next table shows five exemplary
\DWARF\ sections of the new binary.
\placetable[here][tab:dwarf]{Extract of some \DWARF\ sections of a binary}{
\starttabulate[|l|l|]
\HL
\NC {\sc Section} \NC {\sc Description} \NC\NR
\HL
\NC .debug_abbrev \NC Abbreviations used in the .debug_info section \NC\NR
\NC .debug_aranges \NC Lookup table for mapping addresses to compilation units \NC\NR
\NC .debug_frame \NC Call frame information \NC\NR
\NC .debug_info \NC Core DWARF information section \NC\NR
\NC .debug_line \NC Line number information \NC\NR
\HL
\stoptabulate
}
Examining the flags of these section one can notice that they are {\tt
READONLY} without the {\tt ALLOC} or {\tt LOAD} flag.
One may ask why such a thorough analysis was done about sections and
flags, and the answer is, when source code is compiled with debugging
information/symbols these debugging information are not loaded at
runtime nor the {\tt .data} or {\tt .text} sections are altered. Debugging
information will not alter the performance characteristics of a
workload. The downside is, the binary is bigger, hence you need extra
disk space to store this information.
To remove these sections one can use the Linux command {\tt strip}. It
removes the debug sections stated above and the result is a binary
that looks exactly the same as a binary where the {\tt -g} flag was
omitted.
Furthermore there is the myth that debug symbols are somehow
interwoven with executable code so that the execution is slowed
down. One can extract the {\tt .text} section and compare them to
see that they are the same independently if the source code
was compiled with or without the {\tt -g} flag.
\margintext{First compile a source file with debug and without debug symbols.
Next step is to extract the {\tt .text} section which holds the actual code
and build a hash sum.
}
\starttyping
# A is equal to B
# gcc -g -O2 A.c -o A && strip A
# gcc -O2 A.c -o B
# readelf -x .text A | md5sum
# readelf -x .text B | md5sum
f53e82c6da4f3c601fbf5c22e74e5729 -
f53e82c6da4f3c601fbf5c22e74e5729 -
\stoptyping
Just for educational purpose the size of the sections between {\tt A} and
{\tt B}, where evidently no important sections are altered through
debugging symbols and stripping of the binary is shown in Table \in{Table}[tab:elf].
\placetable[here][tab:elf]{Size of sections in an ELF binary}{
\starttabulate[|l|r|r|]
\HL
\NC {\sc Section} \NC {\tt size -A -d ./A} \NC {\tt size -A -d ./B} \NC\NR
\HL
\NC .interp \NC 15 \NC 15 \NC\NR
\NC .note.ABI-tag \NC 32 \NC 32 \NC\NR
\NC .note.gnu.build-id \NC 36 \NC 36 \NC\NR
\NC .gnu.hash \NC 52 \NC 52 \NC\NR
\NC .dynsym \NC 288 \NC 288 \NC\NR
\NC .dynstr \NC 167 \NC 167 \NC\NR
\NC .gnu.version \NC 24 \NC 24 \NC\NR
\NC .gnu.version_r \NC 32 \NC 32 \NC\NR
\NC .rela.dyn \NC 216 \NC 216 \NC\NR
\NC .rela.plt \NC 48 \NC 48 \NC\NR
\NC .init \NC 64 \NC 64 \NC\NR
\NC .plt \NC 96 \NC 96 \NC\NR
\NC .text \NC 464 \NC 464 \NC\NR
\NC .fini \NC 44 \NC 44 \NC\NR
\NC .rodata \NC 4 \NC 4 \NC\NR
\NC .eh_frame_hdr \NC 36 \NC 36 \NC\NR
\NC .eh_frame \NC 132 \NC 132 \NC\NR
\NC .init_array \NC 8 \NC 8 \NC\NR
\NC .fini_array \NC 8 \NC 8 \NC\NR
\NC .jcr \NC 8 \NC 8 \NC\NR
\NC .dynamic \NC 496 \NC 496 \NC\NR
\NC .got \NC 88 \NC 88 \NC\NR
\NC .data \NC 16 \NC 16 \NC\NR
\NC .bss \NC 8 \NC 8 \NC\NR
\NC .comment \NC 52 \NC 52 \NC\NR
\HL
\stoptabulate
}
\subsection[framep]{Frame Pointer}
Before going into detail what impact the frame pointer has on
performance and what issues arise when using {\em perf}, first a little
information how the frame pointer is used.
When a function is called a certain contiguous section of memory is
set aside for the program called the stack. The stack saves return
values, arguments to the called function and e.g. local variables
(depending on the architecture). The stack works the same as the stack
known from C or C++. Items are pushed or popped from the stack.
The local variables, values and arguments to the called function are
grouped into a stack frame that represents a function call. With every
new function call a new stack frame is allocated and the needed data
is pushed onto the stack. When the function exists the data is popped
off the stack. This stack of frames is called the {\em call-stack}.
The {\em call-stacks} are later used in \in[fgraph] \about[fgraph] to
visualize them.
The stack pointer always points the top of the stack and is used for
pushing and popping of elements to the stack. So depending on what is
done on the stack the stack pointer is moving around (up and
down). For simpler access of variables on the stack usually the frame
pointer is used. The frame pointer points to the beginning of the
stack, the return address where to jump back after completing the
current function and is not manipulated during the execution of a
function.
\startplacemarginfigure[ title={A sample stack that is compatible
with most architectures.}, reference=fig1]
\externalfigure[stack][marginwidth]
\stopplacemarginfigure
To address a variable on the stack an offset is added to the frame
pointer and the variable can be accessed through this pointer. It's an
auxiliary pointer to keep addressing simple, but this addressing of
things on the stack works with the stack pointer too, but it is more
{\em complicated}.
Compiling a source code with optimization ({\tt -O1} is enough) the
compiler will omit the frame pointer per default and will be using the
stack pointer to access local variables.
Omitting the frame pointer is performance enhancement where
theoretically one can save a memory write to a cached memory, a few
clock ticks in entry/exit of a function and a general purpose register
is freed up. The compiler can utilize the freed register to produce
code that is smaller and potentially faster.
The problem here is that a debugger will lose an easy way to generate
a stack trace. If frame pointers are not omitted they can be used to
walk the stack. The frame pointer is a linked list of stack frames.
The debugger might still be able to generate a stack trace (when frame
pointers are ommited) from a different source and that's when the
{\tt .eh_frame} section comes into play.
The debugger has to implement stack unwinding with the debug
information in the {\em .eh_frame} section that is generated whether less
{\tt -g} is provided as a compilation flag or not.
\subsubsection{Why is this important for {\em perf}?}
Call-stacks are a nice way to see who is called by whom and this
information can be saved by {\em perf} as well and the saved
events/samples can be correctly correlated to the specific function
calls in the call-stack. Even if the frame pointer is ommitted the
call stack has to be unwinded by {\em perf}, it has to be implemented
(architecture specific) in the same manner as for the debugger.