predictive/eval.tex at master · omertripp/predictive · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
\section{Evaluation}~\label{sec:eval}
Our evaluation focuses on the effectiveness  and scalability.
To measure the effectiveness, we compare with the predictive analysis,
{\sf RV}~\cite{pldi14}. We chose RV for two main reasons: (1) {\sf RV} is the
only open source predictive analysis, (2) {\sf RV} represents the
state-of-the-art, which is theoretically proven to have higher detection
capability than other approaches.

{\bf Prototype Implementation}  We use the Z3 SMT solver for constraint
resolution \cite{MouraB08}. For static discovery of unexecuted
branches, we utilize the SOOT compiler framework \cite{Vallee-RaiCGHLS99}.

For scalability, as well as to constrain memory footprint, we have
applied several engineering optimizations. In specific, (i) events are
stored in a database rather than in an in-memory buffer; and (ii) the
analysis is split into several segments, between which we restart
the VM and resume the last snapshot, such that the different logs
are fused together at the end.

%Besides, we terminate the monitoring run with a shutdown hook if it is beyond 30 minutes.

To improve the scalability in constraint solving, we again apply
a splitting optimization. The full trace is split into $N$ equal-length
subtraces or ``time windows'' (in our experiments, $N=1,000$). The values
of local and shared variables are then passed at the end of a
given window as the inputs to the next window. For lack of space, we
omit the details, and instead refer the reader to
Huang et al. \cite{pldi14}, who apply a similar optimization.


{\bf Evaluation Method} We conducted our experiments on a set of 16
applications, which were also used to evaluate {\sf RV}. Specifically,
the set includes large applications such as {\sf Jigsaw},
{\sf Xalan} and {\sf Lusearch}. To handle large applications that make
use of reflection, we applied the {\sf Tamiflex} tool~\cite{tamiflex}, which
replaces reflective calls with the concrete method invocations recorded
in the observed run.
% for the purpose of effective static analysis.
We omit the benchmark {\sf eclipse} because the current version
of {\sf Tamiflex}  leads to abnormal execution after the instrumentation,
which throws an exception when the main service is started.
By applying \tool\  to the abnormal execution, we identify only 3 races,
similarly to the report of RV~\cite{pldi14}.
In addition, the {\sf montecarlo} benchmark requires an external input
that we downloaded from the internet and simplified. For the large
applications, we utilized the most lightweight available configuration.


As the detection capability  depends on the observed run, for fair
comparison we monitor the execution once (for a period of 10 minutes) by recording all necessary
information required by both approaches, and then apply both techniques
to the monitored run. Besides, our reported data for RV may be different
from the original report for two reasons: (1) the different observed runs
lead to different sets of races, (2) the original implementation of RV
contains a bug in branch identification, which leads to  misses of
many branches and  incorrect reduction of dependences during the analysis.
We confirmed the bug with the author and fixed the bug in our
experiments.\footnote{We have created an in-depth report on the bug,
along with a witness test case. (See:  \url{https://sites.google.com/site/recipe3141/}.)}


Our experiments and measurements were all conducted on an x86 64
Thinkpad W530 workstation with eight 2.30GHz Intel Core i7-3610QM,
16GB of RAM and 6M cache. The workstation runs Ubuntu 12.04,
% of the Ubuntu Linux distribution,
and has the Sun 64-Bit 1.6.0\_26 JVM installed.
%\footnote{We note that Sun JDK 1.7 does not support the transformation of dacapo applications with {\sf tamiflex}.}


%Specifically, we use two threads in the monitored run as two threads are sufficient to manifest most of races~\cite{shanlu}.


\subsection{Effectiveness}
Table~\ref{tab:main} shows the main results of our analysis, which includes four sections: Trace (details about the trace), Races (detected races), Difference (comparison with RV) and Running time (the time taken by the analysis).


The Trace section includes the number of threads ($Th$), the number of
shared reads ($Reads$) and shared writes ($Writes$), the subset of shared
reads that read the base object references  ($Base$), where the  base object
references are references used as the base/target in the following field
reference or method invocation,  the number of branches ($Br$), the
synchronization events ($Sync$) and the total number of events ($Total$),
which includes local accesses in addition to the aforementioned events.
Specifically, branch events are reported in the form $A/B$, where $A$ refers
to the number of branches used by our analysis and $B$ refers to the number
of branches used by RV. To ensure that the predicted run sees the same
base object at each shared read/write, {\sf RV} inserts an artificial
branch immediately in front of each shared access (and array accesses).
We do not use such artificial branches.


We make a few interesting observations about the Trace section.
The non-local events, i.e., all the events listed in Table~\ref{tab:main},
occupy around 30\% of the total trace in the first 7 benchmarks,
but occupy less than 1\% of the trace in the rest of the benchmarks,
which have relatively more complex logic. Reads of the shared base objects
occupy a small portion ($\frac{1}{10}$-$\frac{1}{3}$) of the total
shared reads. The rest of the shared reads read only primitive values
or references not involved in field accesses (e.g., references involved
in nullness checking). Besides, our analysis involves significantly less branch events as compared to the {\sf RV} approach. This difference plays an important role in the detection, which we will explain shortly.


The $Races$ section shows the number of races detected by {\sf Recipe-s},
i.e., {\sf Recipe} without exploring un-executed paths; {\sf Recipe}, i.e., the full-fledged version; and {\sf RV}. Comparison between {\sf Recipe-s}
and {\sf Recipe} reveals that {\sf Recipe} finds $>$100 more races,
which demonstrates the strength of exploring  un-executed paths.
Intuitively,  {\sf Recipe-s} predicts based on a single trace, while
{\sf Recipe} predicts based on multiple traces containing different
execution paths. We also compare between {\sf Recipe} and {\sf RV} version,
as shown in the $Difference$ section, where $Diff$ shows the races found
by {\sf Recipe} but missed by {\sf RV} while $Diff'$ shows the races
found by {\sf RV} but not {\sf Recipe}.

First of all, {\sf Recipe} finds $>$250 more races than {\sf RV}. There
are several reasons for that: (1) {\sf Recipe} can reason about the
accesses in the un-executed paths, while {\sf RV} and {\sf Recipe-s} can
only reason about the accesses in the executed paths. (2) {\sf Recipe}
or {\sf Recipe-s} relax the scheduling even if it
%breaks the inter-thread read/write dependence structure
changes variable values in the observed run. {\sf Recipe} allows the
majority of the shared reads, i.e., the reads of primitive values,
to freely read a different value from a different write
as long as the value leads to the same branch decisions. {\sf RV}, however,
requires them to read the same values as in the observed run. (3) A
critical optimization proposed by {\sf RV} is to preserve dependences
only for the reads before the preceding branches of the racy events,
rather than all the reads. However, this optimization is underplayed
by the fact that {\sf RV} introduces huge amount of artificial
branches, i.e., one before each shared field access, which ensures
the use of the same base objects.
We get rid of such artificial branches and instead, rely on the small
amount of base read events to ensure the use of the same base
objects (Section~\ref{sec:relax1} and Section~\ref{sec:relax2}). Our strategy reduces the number of branches
greatly and amplifies the effectiveness of the optimization. We also
carried out case studies (Section~\ref{sec:cases}) to better illustrate the scenarios.


Another interesting observation is that although {\sf Recipe} should
produce all races found by {\sf RV} in theory, {\sf Recipe} may miss
races found by {\sf RV} in practice (Column $Diff'$), i.e., {\sf Recipe}
is not strictly more effective than {\sf RV} in practice. The reason  is
the limitations of the underlying constraint solver: (1) the solver cannot
compute constraints with very complex arithmetic operations, and (2) the
solver does not support some program constants such as the scientific
notation $3E-10$. In our empirical evaluation, we encountered only one
benchmark, {\sf account}, where these issues affect our detection capability.

The last section, Running time, compares the analysis time for both
approaches. We find our approach is significantly slower than {\sf RV},
e.g., {\sf RV} often finishes within 200 seconds, while our approach may
take more than 1 hour. This is because {\sf Recipe} has a larger search
space and needs to reason about the computation among variables inside
the local access events, while {\sf RV} needs to only reason about
the order relations among the events.


\begin{table*}[htbp]
\caption{Relaxed Analysis}
\begin{tabular}{|l|r|r|r|r|l|r|r|r|r|r|r|r|r|r|}
\hline
 & \multicolumn{ 7}{c|}{Trace} & \multicolumn{ 3}{c|}{Races} & \multicolumn{ 2}{c|}{Difference} & \multicolumn{ 2}{c|}{Running time (sec)} \\ \hline
Benchmarks & \multicolumn{1}{l|}{Th} & \multicolumn{1}{l|}{Reads} & \multicolumn{1}{l|}{Writes} & \multicolumn{1}{l|}{Base} & Br & \multicolumn{1}{l|}{sync} & \multicolumn{1}{l|}{Total} & \multicolumn{1}{l|}{RV} & \multicolumn{1}{l|}{Recipe-s} & \multicolumn{1}{l|}{Recipe} & \multicolumn{1}{l|}{Diff} & \multicolumn{1}{l|}{Diff'} & \multicolumn{1}{l|}{Recipe} & \multicolumn{1}{l|}{RV} \\ \hline
critical & 3 & 13 & 7 & 5 & 2/13 & 6 & 78 & 8 & 8 & 8 & 0 & 0 & 8 & 2 \\ \hline
airline & 11 & 45 & 15 & 4 & 32/82 & 30 & 317 & 9 & 9 & 9 & 0 & 0 & 490 & 4 \\ \hline
account & 3 & 46 & 21 & 10 & 3/47 & 6 & 227 & 5 & 2 & 5 & 3 & 3 & 41 & 4 \\ \hline
pingpong & 4 & 7 & 7 & 3 & 0/15 & 6 & 111 & 1 & 1 & 1 & 0 & 0 & 19 & 1 \\ \hline
bbuffer & 4 & 640 & 118 & 10 & 634/1.1K & 217 & 3.3K & 9 & 13 & 25 & 21 & 5 & 62 & 5 \\ \hline
bubblesort & 26 & 1.3k & 966 & 121 & 155/2.8K & 322 & 8.4K & 7 & 7 & 7 & 0 & 0 & 3295 & 3 \\ \hline
bufwriter & 5 & 165 & 52 & 75 & 16/130 & 44 & 525 & 2 & 4 & 10 & 8 & 0 & 63 & 9 \\ \hline
mergesort & 5 & 38 & 33 & 5 & 15/472 & 28 & 1.7K & 3 & 3 & 10 & 7 & 0 & 37 & 5 \\ \hline
raytracer & 2 & 31 & 5 & 12 & 314/8,2K & 676 & 94.5K & 4 & 4 & 6 & 2 & 0 & 47 & 2 \\ \hline
montecarlo & 2 & 5 & 86 & 2 & 1.9K/38.2K & 21.1K & 1.9M & 1 & 1 & 4 & 3 & 0 & 1 & 17 \\ \hline
moldyn & 2 & 605 & 61 & 104 & 19.6K/52.6K & 62 & 203.4K & 2 & 6 & 14 & 12 & 0 & 2842 & 1 \\ \hline
ftpserver & 28 & 684 & 299 & 71 & 4.4K/233.3K & 78.2K & 3.9M & 57 & 99 & 152 & 108 & 13 & 811 & 153 \\ \hline
jigsaw & 12 & 525 & 702 & 211 & 63.2K/467.9K & 86.7K & 5.5M & 8 & 17 & 23 & 15 & 0 & 33 & 7 \\ \hline
sunflow & 9 & 2.1K & 1.3K & 473 & 201.3K/827.0K & 50K & 7.1M & 20 & 38 & 78 & 69 & 11 & 4520 & 22 \\ \hline
xalan & 9 & 1.4K & 0.9K & 209 & 15.7K/103.2K & 190.1K & 6.6M & 2 & 2 & 6 & 4 & 0 & 5317 & 10 \\ \hline
lusearch & 10 & 2.3K & 0.5K & 715 & 22.2K/164K & 93.2K & 9.1M & 14 & 27 & 49 & 38 & 3 & 5430 & 8 \\ \hline
\end{tabular}
\label{tab:main}
\vspace{-2em}
\end{table*}


%\begin{table*}[htbp]
%\caption{Relaxed Analysis}
%\begin{flushleft}
%\begin{tabular}{|l|r|r|r|r|l|r|r|r|r|r|r|r|r|r|}
%\hline
% & \multicolumn{ 7}{c|}{Trace} & \multicolumn{ 3}{c|}{Races} & \multicolumn{ 2}{c|}{Difference} & \multicolumn{ 2}{c|}{Running time (sec)} \\ \hline
%Benchmarks & \multicolumn{1}{l|}{Th} & \multicolumn{1}{l|}{Reads} & \multicolumn{1}{l|}{Writes} & \multicolumn{1}{l|}{Base} & Br & \multicolumn{1}{l|}{sync} & \multicolumn{1}{l|}{Total} & \multicolumn{1}{l|}{Recipe-s} & \multicolumn{1}{l|}{Recipe} & \multicolumn{1}{l|}{RV} & \multicolumn{1}{l|}{Diff} & \multicolumn{1}{l|}{Diff'} & \multicolumn{1}{l|}{Recipe} & \multicolumn{1}{l|}{RV} \\ \hline
%critical & 3 & 13 & 7 & 5 & 2/13 & 6 & 78 & 8 & 8 & 8 & \multicolumn{1}{c|}{0} & 0 & 8 & \multicolumn{1}{c|}{2} \\ \hline
%airline & 11 & 45 & 15 & 4 & 32/82 & 30 & 317 & 9 & 9 & 9 & 0 & 0 & 490 & 4 \\ \hline
%account & 3 & 46 & 21 & 10 & 3/47 & 6 & 227 & 2 & 5 & 5 & 3 & 3 & 41 & 4 \\ \hline
%pingpong & 4 & 7 & 7 & 3 & 0/15 & 6 & 111 & 1 & 1 & 1 & 0 & 0 & 19 & 1 \\ \hline
%bbuffer & 4 & 640 & 118 & 10 & 634/1.1K & 217 & 3.3K & 13 & 25 & 9 & 21 & 5 & 62 & 5 \\ \hline
%bubblesort & 26 & 1.3K & 966 & 121 & 155/2.8K & 322 & 8.4K & 7 & 7 & 7 & 0 & 0 & 3295 & 3 \\ \hline
%bufwriter & 5 & 165 & 52 & 75 & 16/130 & 44 & 525 & 4 & 10 & 2 & 8 & 0 & 63 & 9 \\ \hline
%mergesort & 5 & 38 & 33 & 5 & 15/472 & 28 & 1.7K & 3 & 10 & 3 & 10 & 0 & 37 & 5 \\ \hline
%raytracer & 2 & 31 & 5 & 12 & 314/8,2K & 676 & 94.5K & 4 & 6 & 4 & 2 & 0 & 47 & 2 \\ \hline
%montecarlo & 2 & 5 & 86 & 2 & 1.9K/38.2K & 21.1K & 1.9M & 1 & 4 & 1 & 3 & 0 & 1 & 17 \\ \hline
%moldyn & 2 & 605 & 61 & 104 & 19.6K/52.6K & 62 & 203.4K & 6 & 14 & 2 & 12 & 0 & 2842 & 1 \\ \hline
%ftpserver & 28 & 684 & 299 & 71 & 4.4K/233.3K & 78.2K & 3.9M & 99 & 152 & 57 & 108 & 13 & 811 & 153 \\ \hline
%jigsaw & 12 & 525 & 702 & 211 & 63.2K/467.9K & 86.7K & 5.5M & 17 & 23 & 8 & 15 & 0 & 33 & 7 \\ \hline
%sunflow & 9 & \multicolumn{1}{l|}{2.1K} & \multicolumn{1}{l|}{1.3K} & 473 & 201.3K/827.0K & 50K & \multicolumn{1}{l|}{7.1M} & 38 & 78 & 20 & 69 & 11 & 4520 & 22 \\ \hline
%xalan & 9 & \multicolumn{1}{l|}{1.4K} & \multicolumn{1}{l|}{0.9K} & 209 & 15.7K/103.2K & \multicolumn{1}{l|}{190.1K} & \multicolumn{1}{l|}{6.6M} & 2 & 6 & 2 & 4 & 0 & 5317 & 10 \\ \hline
%lusearch & 10 & \multicolumn{1}{l|}{2.3K} & \multicolumn{1}{l|}{0.5K} & 715 & 22.2K/164K & \multicolumn{1}{l|}{93.2K} & \multicolumn{1}{l|}{9.1M} & 27 & 49 & 14 & 38 & 5 & 5430 & 8 \\ \hline
%\end{tabular}
%\end{flushleft}
%\label{tab:main}
%\end{table*}


\subsection{Case studies}~\label{sec:cases}
We manually inspected the reported races over small applications to gain better understanding about our approach.

{\bf Relaxing Values\ } Figure~\ref{fig:relax1} demonstrates the relaxation of inter-thread dependence enabled by our approach.
The code is from the benchmark {\sf bbuffer}, where the line numbers are
marked. Huang et al.~\cite{pldi14} detect the race between lines 291 and 400,
but fail to detect the race between lines 294 and 400. The reason is
that in the observed run, the execution follows the order 400, 291 and 294.
For the event at line 294, its preceding branch at line 291 reads
from line 400. Therefore, Huang et al.~\cite{pldi14}
require the predicted run to preserve the dependence between line 291 and
line 400, so that line 291 reads exactly the same value and the branching
expression evaluates to the same truth value. The dependence enforces
the order constraint, $400 \rightarrow 291$, which further enforces
the order $400\rightarrow 291 \rightarrow 294$.    Our approach does not
require the existence of such dependence. Specifically, we allow line
291 to happen before line 400 in the predicted run as long as the value
read by it leads to the same branch decision, which is true in this case.
As a result, there is no order constraint between line 294 and
line 400, and the two form a race. We were able to replay this race
easily using the eclipse IDE breakpoints.


\begin{figure}[htp]
\centering
\includegraphics[scale=0.45]{figs/Visio-bbuffer.pdf}
\caption{Relaxation of Inter-thread dependence}\label{fig:relax1}
\vspace{-1em}
\end{figure}

{\bf Relaxing the Paths\ } Figure~\ref{fig:relax2} demonstrates how our
approach relaxes the scheduling to account for  unexecuted paths in
{\tt mergesort}.
All threads invoke the method {\sf Sorting}, which recursively starts
two children threads if there are two more available entries in the
thread pool (lines 8-11), or starts one child thread if there is only
one available entry (lines 4-5).  The availability is computed through
the static method {\sf available} with the constant {\tt total}, which is equal to 5. The shared variable {\tt alive} denotes the used entries.


Initially there is one sorting thread. After it starts Thread 1 and
Thread 2, there are three threads alive and only two more
entries are available. Suppose that in the observed run Thread 2
consumes both thread entries and starts the child Thread 3
and Thread 4 (not shown), updating {\tt alive} to 5. Then Thread 1
cannot execute the branch at lines 4-5. Huang et al.~\cite{pldi14} require the predicted run to preserve the dependence denoted by the dotted line since this dependence affects the branch condition at line 2. As a result, the predicted run follows the same branch decision and cannot reason about the code at lines 4-5. Our analysis does not suffer from this limitation. Instead, it allows the predicted run to reason about the un-explored code. Specifically, it does not need to preserve the dependence from line 11 to line 1, and it allows line 1 (Thread 1) to read from line 9 (Thread 2). As a result, the branch condition guarding the unexplored branch is evaluated to be true, enabling the unexplored path in the constraint solver. Finally, the solver identifies the race between line 5 (Thread 1) and line 1 (Thread 3). Note that the two lines are synchronized on different locks~\footnote{We abbreviate the {\tt synchronized} keyword as {\tt sync}.
}.


\begin{figure}[htp]
\centering
\includegraphics[scale=0.45]{figs/Visio-msort.pdf}
\caption{Relaxation of Paths}\label{fig:relax2}
\vspace{-2em}
\end{figure}


\section{Threats to Validity}
A Java method can consist of up to 65535 bytecode instructions. Therefore,
 we count the number of bytecode instructions inside each method. If the
resulting number exceeds 65535, we avoid instrumenting the method.
The consequence is that we will potentially miss races inside a method for
which we skipped instrumentation. In this case, we specify the variables
read from the method to be equal to their concrete values in the constraints.
The constraint solver can proceed safely without being affected by
such methods.

\tool\  lacks support for certain operators. In our current
implementation, we do not support bit operations such as $\&$ and $<<$, which accounts for most of our misses.