-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path3_using.tex
468 lines (360 loc) · 31.7 KB
/
3_using.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
\chapter{Using Deep Learning to Detect Technical Debt}
We aim to create a system that automatically detects technical debt in a class method. To achieve this goal, we took the following three steps: dataset creation, classification and hyperparameter tuning.
The dataset was automatically created by mining and processing open-source projects histories. It is a balanced dataset by construction and the two classes are referred to as: SATD and fixed.
We initially identify technical debt through the presence of SATD in the comments of the source code, using keyword labels pattern matching \cite{potdar2014exploratory} \cite{rantala2020prevalence}.
The identification of a SATD/fixed method pair is based on a strong assumption: when the SATD comment disappears, due to a commit, we suppose that the technical debt is fixed; so, we regard the new code as TD-free, belonging to the fixed class.
% detail: describe the circumstances where we reject this hypotheses
The classifier is a neural network capable of representing snippets of code with a fixed-length vector, conceptually similar to how word2vec works. The learned vector representation (code embedding) is associated with a semantic label (i.e. SATD or fixed).
Lastly, to increase the performance of the system we select a set of hyperparameters and perform a distributed grid search for tuning their values. The rest of this chapter explains each part in greater detail.
%3.1
\section{Mining SATD Instances and their Fixes}
\label{sec:mining_satd}
These are the fundamental activities involved in the creation of the dataset:
\begin{itemize}
\item GitHub repository URL mining: using GitHub API, we extracted 248,872 projects' URLs matching our search profile.
\item Repository cloning and filtering: some projects can be discarded only after the commit history is available for a precursory analysis to determine if a project qualifies for the next activity.
\item Commit history processing: the directed acyclic graph of the commits is traversed and the source code is parsed in search of acceptable SATD/fixed pair.
\end{itemize}
The following sections describe these activities in detail.
\subsection{GitHub repository URL mining}
We know from experience and the literature that the concentration of SATD is very low \cite{bavota2016large} \cite{maldonado2015detecting} \cite{potdar2014exploratory}, for this reason we created a dataset from an initial set of open-source projects as big as possible.
The search was conducted for all public Java GitHub repositories in a 20 years time window (from 2000 to 2019).
The retrieval process of this URL list was challenging in itself. We dealt and solved the following issues:
\begin{enumerate}
\item GitHub search API limit of 1000 hits.
\item GitHub search API number of request per minute maximum quota.
\item Filtering to remove low-quality repositories.
\item Unexpected interruptions.
\end{enumerate}
We addressed the first issue with two different intertwined phases of queries: probe and harvest. The probe requested only the number of the repositories that were created in a specific time window; if the number exceeds the 1000 limit, it iteratively divides the probe query into two sub-queries using half of the time window for each and adds them to the job queue.
It was not enough to specify the day interval in the query; the time of the day was also needed because in the recent years there are multiple instances of more than 1000 repositories created in a single day.
The second issue was solved using an authentication token, as specified by the GitHub documentation.
The third issue was the reason we switched to the GitHub GraphQL API; we quickly realized that the total number of repositories in the selected period was more than seven million and we needed a way to trim down the list to those most meaningful repositories. We empirically defined, through a trial and error process, a metric to keep those repositories with a good likelihood to contain \textit{higher quality source code}.
The GraphQL Repository API can be queried to return additional information; for each repository, we requested the issue count and the commit count so to discard quickly those with less or equal than 100 commits and 100 issues. It must be noted that the GraphQL query automatically excluded some repositories because of atypical structures (e.g. Subversion to git converts and repositories with non-standard naming).
The last issue affects all long-running processes: unexpected interruptions, e.g. network outages, program crashes, API service unavailability. To fix these problems, all failed queries are repeated for a maximum of 50 times, then the program is halted with an exception.
The program can recover from a crash because every query is cached to a file; when a query is issued and the cache is present the network request is skipped and the file content is used instead. This was important because the URLs mining execution took roughly 150 hours and a system to reuse past expended resources were needed.
The URLs mining tool creates two text files: one with the complete list and one with the URLs that match our acceptance criteria. Both files contain the following columns:
\begin{itemize}
\item Repository creation date.
\item User name and repository name.
\item Issue count.
\item Commit count.
\end{itemize}
The total number of URLs retrieved was roughly seven million but we accepted only a subset of them: around 250 thousand repositories.
% expand on cache because it takes a few days to mine all URLs
\subsection{Repository cloning and filtering}
% 250.000 cloni
% esclusione repository Android OS , 300.000 commits
% spazio su disco e cancellazione on the go
% the processing was done just after one repo was cloned
% description next section
This section describes the process of cloning 250 thousand repositories and applying further filtering.
The repository clone task was conducted without the checkout, i.e. only the commit history was downloaded without creating the working copy for the last commit. This approach saved disk space and processing time, particularly when the last commit contained a high number of files.
Once the repository was locally accessible, using the library JGit\footnote{\url{https://www.eclipse.org/jgit/}}, we tested the presence of a few files that indicated an Android OS project; if it was the case, the repository was rejected.
The problems related to the Android OS was that there were more than 1,700 forks/clones and those repositories counted between 300,000 and 550,000 commits; it is one of the biggest project encountered in this endeavour and it posed a significant bottleneck to the commit history processing: traversing the commit tree could take a couple of days only for one Android OS repository.
The exclusion of Android OS in this phase was established only for performance reasons.
The general issue of detecting forked (i.e. duplicated) repositories will be better explained in the next section.
\subsection{Commit history processing}
This section describes traversing all the commits of a repository and collecting the code snippets to our SATD/fixed dataset.
What we are generally looking for is a method body that is affected with SATD and after a commit is not affected by it anymore, i.e. the SATD was removed. What follows from it is that we do not need to blindly parse all the Java files but only those that are changed by a commit.
For each repository we iterate on every commit and every file change; but we care only of \textit{modify} type operations; we ignore the rest of git file change operations: add, delete, rename and copy.
At this stage we just have a pair of Java source code texts: the old version before the commit and the new after it; they are also called \textit{before image} and \textit{after image}. Next, we parse the before image and run the SATD detector; if hits are found, then we parse also the after image.
Now we tackle the previous explanation with a different abstraction level in mind: not from the source file point of view but at the method level.
What we have is a list of methods affected by SATD coming from the before image and another list of methods created from the after image; we couple the items of these two lists by method name and accept as viable candidates only those pairs whose after image method is not affected with SATD. The following code implements the semantic described before (old stands for before image and new stands for after image):
\begin{center}
\makebox[\textwidth]{\includegraphics[width=0.7\columnwidth]{images/pairing_methods.png}}
\end{center}
There are a few considerations to ponder:
\begin{itemize}
\item The appearance order of the methods in the source code does not affect the process.
\item When a method is removed it is automatically excluded through the lack of pairing between the two lists.
\item The renamed methods are lost; they are treated as removed.
\end{itemize}
\noindent The rest of this section develops some aspects that were not fully explained before but that are important to the creation of the dataset.
\\
\\
\textbf{Keyword label pattern matching.}
To detect the SATD, we use keyword label techniques using the 62 patterns reported by Potdar and Shihab \cite{potdar2014exploratory} (they are actually 63\footnote{\url{http://users.encs.concordia.ca/~eshihab/data/ICSME2014/satd.html}}). Out of these patterns we create regular expressions that are applied on all the source comments extracted with JavaParser.
We did an initial experiment to verify the effectiveness of those patterns and we excluded the following two: ``there is a problem'' and ``bail out''. After manually verifying roughly one hundred matches we noticed that those two patterns were often used as documentation in `catch' Java blocks and were not documenting SATD. The final list used in our tool is found in listing \ref{lst:patterns61}.
\textbf{Snippets preprocessing and cleaning.} \label{sec:cleaning} Every sample in the dataset contains the verbatim method source code pair (before image and after image). It also contains a new pair (called `clean') cleaned by all comments and string literals. The string cleaning process replaces all the non-null and non-empty strings to a constant: ``--\#\#string\#\#--''. In other words, the string type values are constrained to this set: null, empty value and ``--\#\#string\#\#--''.
\textbf{Precomputed features}
The neural model used in this thesis needs a specific representation for each code snippet (see section \ref{sec:code2vec_ast_paths} for details). The computation of such representation is resource expensive and we decided to store it in the database alongside with the sample (columns old\_clean\_features and new\_clean\_features).
\textbf{Handling forks and duplicated code.}
In our research we found that many repositories were duplicates or clones (GitHub keeps track of forks but do not track duplicates).
We need to be very careful not to introduce noise in the dataset. To ensure a better quality for our samples we implemented the following steps:
\begin{itemize}
\item We concatenated the clean images (before and after images) and computed the hash of such a string. This hash is stored in a database field with a unique index on it; this ensures that we do not have duplicated pairs.
\item Two additional hashes are calculated: one for the clean before image and a second for the clean after image. Then we discard all pairs that are found having a hash in common between any before and after clean image hash; this ensures that we cannot have the same snippet in both before and after image.
\item We did the same process as described in the previous point but applied to the precomputed features.
\end{itemize}
\textbf{Rejected snippets.}
% see JavaMethodTest.kt
We implemented unit tests\footnote{\url{https://github.com/simonegiacomelli/code2vec-satd-classifier/blob/master/satd-classifier/src/jvmTest/kotlin/satd/step2/JavaMethodTest.kt}} to be sure to reject specific Java constructs that would pose issues for the pipeline. For example, methods containing inner named methods (explicit interface implementations) were discarded because the parser in code2vec recognized them as two distinct methods, which is not correct.
\section{The Deep Learning Model}
The model described in this section is called code2vec \cite{alon2019code2vec}.
The following three sequential processes can be considered a chain that progressively transforms the source code into the desired target (i.e. the two classes SATD/fixed):
\begin{itemize}
\item Decompose
\item Aggregate
\item Predict
\end{itemize}
\noindent Each of these steps can be viewed as a process that takes an input, creates an intermediate representation and generates an output for the next process.
In the rest of this section we give some details about each one of them and introduce some technical terms.
\\
\\
\noindent \textbf{Decompose.} The input of this process is the source code. The output generated is a bag of path-context. The following list gives more detail on the process and the intermediate representations:
\begin{itemize}
\item Parsing and creation of the abstract syntax tree (AST) of the method source code.
\item Extraction of all AST-paths (up to a fixed limit).
\item Encoding of the AST-paths into a bag of context-vector.
\item Transform each context-vector into a path-context (so to obtain a bag of path-context).
\end{itemize}
\noindent \textbf{Aggregate.} This process aggregates the bag of path-context (the output from the previous process) using an attention vector. The final result is a code-vector that represents the snippet of code as a continuous distributed vector, i.e. a `code embedding'.
\\
\\
\noindent \textbf{Predict.} The code-vector is fed to a fully connected neural network that performs the classification using the desired classes (i.e. SATD/fixed).
\\
\\
\noindent In the following sections we expand and dig deeper into these concepts.
% Here I want to explain briefly what are the fundamental pieces and how they interact (without pretending that the reader really understands, but to give the general idea)
\subsection{Representing code using AST-paths} \label{sec:code2vec_ast_paths}
This section describes how to capture semantic information from code snippets and create a representation that can be used to predict properties of the snippet, for example a label (e.g. SATD/fixed).
To better illustrate how the decomposition is done, we use the simple code snippet in listing \ref{lst:snippet_ast_code} as an example.
%AST of code snippet
\begin{lstlisting}[caption={Example code to show decomposition}, label={lst:snippet_ast_code},language=Java]
String METHOD_NAME() {
if(somePreCondition())
while(!completed())
doWork();
return "ok";
}
\end{lstlisting}
Using JavaParser\footnote{\url{https://javaparser.org/}} the AST is extracted from the source code; see an example in figure \ref{fig:AST_graphviz}.
\begin{figure}
\centering
\resizebox{\columnwidth}{!}{
\includesvg[]{images/AST_graphviz}
}
\caption[AST for listing \ref{lst:snippet_ast_code}.]{Abstract syntax tree of the source code presented in listing \ref{lst:snippet_ast_code}.}
\label{fig:AST_graphviz}
\end{figure}
We identify three sets of node types in the AST: values, terminal and non-terminal nodes. Values are the leaves (this set is identified with $X$), terminal nodes are the immediate parent of a leaf and the rest are the non-terminal ones.
\textit{AST-path definition}: it is a path connecting two terminal nodes and it must include one non-terminal node which is a common ancestor of the two terminal nodes.
The representation of the program is the \textit{set of all its AST-paths}.
To wrap up this section, we explain the last AST-path in figure \ref{fig:AST_paths} and, for convenience, this AST-path is also reported here:
%AST-paths only last
\begin{center}
\makebox[\textwidth]{\includegraphics[width=\columnwidth]{images/AST_paths_last.png}}
\end{center}
The two value nodes are the first and the last words, `dowork' and `ok' respectively; the reader can also identify these leaf nodes in the bottom-right part of figure \ref{fig:AST_graphviz}.
The central part of the AST-path above is the connecting path between those two value nodes: it lists all the intermediate nodes and it also specifies the direction (up or down) one needs to take to traverse the tree. The common ancestor, for this example, is the intermediate node called `block'.
%AST-paths of code snippet
\begin{figure}
\centering
\resizebox{\columnwidth}{!}{
\includegraphics{images/AST_paths.png}
}
\caption[All AST-paths for listing \ref{lst:snippet_ast_code}.]{AST-paths of listing \ref{lst:snippet_ast_code}.}
\label{fig:AST_paths}
\end{figure}
% here I go into details on the first decomposition/representation of the code snippets into fixed-length vectors
What is seen in figure \ref{fig:AST_paths} is a bag of context-paths (another name for the set of all AST-paths) and it is the representation for the code snippet. The following section explains how these AST-paths are transformed into fixed-length vectors.
\subsection{Context-vector}
%how many OOV?
%https://github.com/tech-srl/code2vec/blob/c98e8f786b7262e56c93e520d039fb7aa5d0f7ef/vocabularies.py#L123
A context-vector $c_i$ is the vector representation for an AST-path. The process of transformation is applied to all AST-paths producing a bag of context-vectors.
The following picture shows how the context-vector is formed:
\begin{center}
\makebox[\textwidth]{\includegraphics[width=\columnwidth]{images/context_vector.png}}
\end{center}
To explain the picture above we need to introduce two matrices:
\begin{align*}
value\_vocab \in \mathbb{R}^{|X| \times d}
\\
path\_vocab \in \mathbb{R}^{|P| \times d}
\end{align*}
% \begin{itemize}
% \item value\_vocab $\in \mathbb{R}^{|X| \times d}$
% \item path\_vocab $\in \mathbb{R}^{|P| \times d}$
% \end{itemize}
The embedding size $d$ is a hyperparameter. $X$ is the set of values of the AST terminals that were observed during training; in our recurring example this set is composed of: string, METHOD\_NAME, someprecondition, completed, dowork and ok.
$P$ is the set of all AST-paths across all snippets.
These matrices are initialized randomly and are learned by the model during the training.
An embedding (either from value\_vocab or path\_vocab) is looked up selecting the appropriate row in its matrix.
The previous notions tell us that:
\begin{align*}
c_i \in \mathbb{R}^{3 d}
\end{align*}
The two matrices value\_vocab and path\_vocab do not need to be of the same width $d$ but for convenience it was chosen so.
\subsection{Path-context}
The previous section describes the definition of the context-vector $c_i$. Applying a fully connected layer to it we obtain the path-context vector $\widetilde{c}_i$, also called combined context-vector. The following equation describes the computation of this layer:
\begin{align*}
\widetilde{c}_i = \tanh( W \cdot c_i )
\end{align*}
where $W$ is the weight matrix and
\begin{align*}
W \in \mathbb{R}^{d \times 3 d}
\end{align*}
The height of $W$ is for convenience of the same size as before ($d$); it does not need to be strictly so: it can also be of different height.
One other way to look at this layer is that it compresses the context-vector $c_i$ of size $3d$ into a combined context-vector of size $d$.
\subsection{Attention mechanism and the code-vector} \label{sec:attention_code_vector}
The previous section left us with a set of path-contexts. The goal of this part of the model is to combine them all into a code-vector.
This step employs an attention vector $a$ (see figure \ref{fig:code2vec_arch}) it can be described as a weighted average. The weights are initialized randomly and learned with respect to the bag of path-contexts.
\begin{figure}
\centering
\resizebox{\columnwidth}{!}{
\includegraphics{images/code2vec_arch.png}
}
\caption[]{Code2vec architecture. Alon et al. \cite{alon2019code2vec}}
\label{fig:code2vec_arch}
\end{figure}
\subsection{Training and prediction}
% detailed explanations in step:
% - representation idea (ast, path learning)
% - semantic labeling
% - SATD/fixed
% The model used is called code2vec. The idea is to represents source code snippets using a fixed-length vector i.e. code embedding.
% The abstract syntax tree (AST) is used, there is an attention mechanism and it was successful in predicting semantic labeling.
% Simple and fast
The code-vector is used as input for a binary classifier. During the training, the model learns how to classify two classes: SATD and fixed.
In the prediction phase it will calculate the probability that a specific class should be assigned to the given method body.
%3.3
\section{Hyperparameter Tuning}
This section describes the process of tuning the hyperparameters of the model.
The hyperparameter tuning was performed on a subset of the mined dataset. Such subset is composed of 106,272 instances (53,136 pairs) and it features methods having less than 200 tokens. Such constraint was mainly due to the high cost of training when dealing with longer methods; an example of a method composed of 199 tokens is shown in listing \ref{lst:snippet199}.
%The initial experiments led us to quickly understand that using shorter snippets yielded better performances. For example, a dataset with snippets shorter than 25 tokens has 4,256 samples and gives a 73\% test accuracy. A dataset with snippets shorter than 200 tokens has 106,272 samples and gives a 58\% test accuracy.
%Using a smaller set of the dataset decreases the usefulness of the trained model because of less capability in generalization. For the tuning of the hyperparameters, we empirically defined them to keep those snippets with less than 200 tokens;
\begin{lstlisting}[caption={Code snippet with 199 tokens}, label={lst:snippet199},language=Java]
private CompoundWorkflow finishCompoundWorkflow(
WorkflowEventQueue queue,
CompoundWorkflow compoundWorkflow,
String taskOutcomeLabelId,
String userTaskComment,
boolean finishOnRegisterDocument,
List<NodeRef> excludedNodeRefs) {
if ((finishOnRegisterDocument &&
compoundWorkflow.isStatus(Status.FINISHED)) ||
(!finishOnRegisterDocument &&
checkCompoundWorkflow(compoundWorkflow,
Status.IN_PROGRESS,
Status.FINISHED) == Status.FINISHED)) {
if (log.isDebugEnabled()) {
log.debug("--##string##--" + compoundWorkflow);
}
} else {
setWorkflowsAndTasksFinished(queue, compoundWorkflow,
taskOutcomeLabelId, userTaskComment,
finishOnRegisterDocument, excludedNodeRefs);
if (finishOnRegisterDocument || excludedNodeRefs != null) {
stepAndCheck(queue, compoundWorkflow);
} else {
stepAndCheck(queue, compoundWorkflow, Status.FINISHED);
}
boolean changed = saveCompoundWorkflow(queue,
compoundWorkflow, null);
if (log.isDebugEnabled()) {
log.debug("--##string##--" + compoundWorkflow);
}
}
CompoundWorkflow freshCompoundWorkflow =
getCompoundWorkflow(compoundWorkflow.getNodeRef());
if (!finishOnRegisterDocument && excludedNodeRefs == null) {
checkCompoundWorkflow(freshCompoundWorkflow, Status.FINISHED);
}
checkActiveResponsibleAssignmentTasks(
freshCompoundWorkflow.getParent());
return freshCompoundWorkflow;
}
\end{lstlisting}
\noindent We based the tuning operation on these three hyperparameters:
\begin{itemize}
\item \textit{default\_embeddings\_size:} this value defines the length of the code vector, i.e. the vector representation of the snippet (see section \ref{sec:attention_code_vector}).
\item \textit{max\_contexts:} it is the maximum number of AST-paths used by the model (see section \ref{sec:code2vec_ast_paths}).
\item \textit{dropout\_keep\_rate:} the dropout is a random removal of neurons to prevent excessive adaptation to the training values and in so doing, reduce the likelihood of the network overfitting.
\end{itemize}
To explore the best values for these parameters we created a distributed experiment using Google Colab Pro and a tool called Optuna.
\textbf{Optuna} is defined as a ``next-generation hyperparameter optimization framework''. It is capable to construct the parameter search space dynamically and it implements both searching and pruning strategies \cite{optuna_2019}. The initial experiments were conducted with a competing tool, Hyperopt \cite{bergstra2013making}, but it was abandoned in favor of Optuna.
The Optuna distributed worker was easier to setup and the distributed experiment gave less friction than Hyperopt. Figure \ref{fig:optuna_objective} shows a simplified version of the objective function required by Optuna where we define the search space for each hyperparameter. Our objective value is the test prediction accuracy.
\begin{figure}[hbt!]
\centering
\resizebox{\columnwidth}{!}{
\includegraphics{images/optuna_objective.png}
}
\caption[]{Simplified version of the objective function - optuna\_worker.py}
\label{fig:optuna_objective}
\end{figure}
The first rounds of experiments yielded good values for \textit{max\_contexts} parameter: we found that the optimal value lies between 250-300. The other two parameters, \textit{default\_embeddings\_size} and \textit{dropout\_keep\_rate}, needed a different search space: figure \ref{fig:optuna_first} clearly shows the first two value bubbles crushed on the left. This suggested a new experiment with a different value search window shown in figure \ref{fig:optuna_second}.
\begin{figure}
\centering
\resizebox{\columnwidth}{!}{
\includegraphics{images/optuna_first.png}
}
\caption[]{First experiment, grid search slice plot}
\label{fig:optuna_first}
\end{figure}
\begin{figure}
\centering
\resizebox{\columnwidth}{!}{
\includegraphics{images/optuna_second.png}
}
\caption[]{Slice plot for second grid search experiment with a centered search space}
\label{fig:optuna_second}
\end{figure}
\textbf{Google Colab Pro.} The distributed experiment was executed on Google Colab infrastructure. We used 12 concurrent sessions, both with GPU and CPU. The Python notebook contained only a few lines of code (see listing \ref{lst:colab}). The first operation was to clone the GitHub repository with the source code for the distributed experiment. The second task was to invoke a Python function to setup the Colab environment: download PostgreSQL binaries, download and restore the backup with the dataset, install all the code2vec libraries and dependencies. The third and last step starts the Optuna worker. The results were stored in a central database as per Optuna design.
\begin{lstlisting}[caption={Google Colab notebook code}, label={lst:colab},language=Python]
!cd /content; cd code2vec-satd-classifier && git pull || git clone \\
https://github.com/simonegiacomelli/code2vec-satd-classifier
%cd /content/code2vec-satd-classifier/code2vec-satd
import satd_colab_starter as starter; starter.main()
!python3 optuna_worker.py
\end{lstlisting}
% best satd-6 trial_id=37906,37915 0.615% accuracy
\noindent The best hyperparameters configuration found was the following:
\begin{itemize}
\item default\_embeddings\_size: 112
\item max\_contexts: 290
\item dropout\_keep\_rate: 0.2
\end{itemize}
This setup raised the accuracy from 58\% (using default values) to 61.5\%.
\begin{figure}
\centering
\resizebox{\columnwidth}{!}{
\includegraphics{images/optuna_history.png}
}
\caption[]{Second experiment, grid search optimization history plot}
\label{fig:optuna_history}
\end{figure}
%code2vec model reported by keras
% 2020-12-27 16:18:22,016 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,016 INFO Layer (type) Output Shape Param # Connected to
% 2020-12-27 16:18:22,016 INFO ==================================================================================================
% 2020-12-27 16:18:22,016 INFO input_1 (InputLayer) [(None, 200)] 0
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO input_2 (InputLayer) [(None, 200)] 0
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO input_3 (InputLayer) [(None, 200)] 0
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO token_embedding (Embedding) (None, 200, 112) 43483664 input_1[0][0]
% 2020-12-27 16:18:22,017 INFO input_3[0][0]
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO path_embedding (Embedding) (None, 200, 112) 94047408 input_2[0][0]
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO concatenate (Concatenate) (None, 200, 336) 0 token_embedding[0][0]
% 2020-12-27 16:18:22,017 INFO path_embedding[0][0]
% 2020-12-27 16:18:22,017 INFO token_embedding[1][0]
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO dropout (Dropout) (None, 200, 336) 0 concatenate[0][0]
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO time_distributed (TimeDistribut (None, 200, 336) 112896 dropout[0][0]
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,017 INFO input_4 (InputLayer) [(None, 200)] 0
% 2020-12-27 16:18:22,017 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,018 INFO attention (AttentionLayer) ((None, 336), (None, 336 time_distributed[0][0]
% 2020-12-27 16:18:22,018 INFO input_4[0][0]
% 2020-12-27 16:18:22,018 INFO __________________________________________________________________________________________________
% 2020-12-27 16:18:22,018 INFO target_index (Dense) (None, 3) 1008 attention[0][0]
% 2020-12-27 16:18:22,018 INFO ==================================================================================================
% 2020-12-27 16:18:22,018 INFO Total params: 137,645,312
% 2020-12-27 16:18:22,018 INFO Trainable params: 137,645,312
% 2020-12-27 16:18:22,018 INFO Non-trainable params: 0
% 2020-12-27 16:18:22,018 INFO __________________________________________________________________________________________________