-
Notifications
You must be signed in to change notification settings - Fork 0
/
mlp-cw2-questions.tex
261 lines (225 loc) · 37.9 KB
/
mlp-cw2-questions.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
%% REPLACE sXXXXXXX with your student number
\def\studentNumber{s1803764}
%% START of YOUR ANSWERS
%% Add answers to the questions below, by replacing the text inside the brackets {} for \youranswer{ "Text to be replaced with your answer." }.
%
% Do not delete the commands for adding figures and tables. Instead fill in the missing values with your experiment results, and replace the images with your own respective figures.
%
% You can generally delete the placeholder text, such as for example the text "Question Figure 3 - Replace the images ..."
%
% There are 5 TEXT QUESTIONS. Replace the text inside the brackets of the command \youranswer with your answer to the question.
%
% There are also 3 "questions" to replace some placeholder FIGURES with your own, and 1 "question" asking you to fill in the missing entries in the TABLE provided.
%
% NOTE! that questions are ordered by the order of appearance of their answers in the text, and not necessarily by the order you should tackle them. You should attempt to fill in the TABLE and FIGURES before discussing the results presented there.
%
% NOTE! If for some reason you do not manage to produce results for some FIGURES and the TABLE, then you can get partial marks by discussing your expectations of the results in the relevant TEXT QUESTIONS. The TABLE specifically has enough information in it already for you to draw meaningful conclusions.
%
% Please refer to the coursework specification for more details.
%% - - - - - - - - - - - - TEXT QUESTIONS - - - - - - - - - - - -
%% Question 1:
\newcommand{\questionOne} {
\youranswer{
We would expect the VGG38 model to perform better than the VGG08 model due to the very large increase in the number of layers, however, this is not the case. We can see from figure \ref{fig:acc_curves} that the train and validation accuracies for the VGG38 model stagnate around 0\% throughout all the training epochs, unlike the VGG8 model where the train and validation accuracies increase with the number of epochs in a logarithmic fashion resulting in the model achieving around 49\% validation accuracy in the last few epochs. From these results it is evident that there is something broken in the training process of the VGG38 model. After analysing figures \ref{fig:grad_flow_08} and \ref{fig:grad_flow_38}, it is clear that the VGG38 model suffers from the vanishing gradient problem. When comparing the gradient flow of the healthy VGG08 model and the unhealthy VGG38 model we can see that they are extremely different. Unlike the VGG08 model we can see that the gradients for the VGG38 model are 0 (or very close to 0) in almost all the layers throughout all the epochs. This is a classic symptom of the vanishing gradient problem
%Given the magnitude of differences found from these gradient flow visualisations it is evident that the problem behind VGG38's poor performance lies in the way that the weights for the majority of layers in the VGG38 network remain constant throughout all the epochs and do not converge towards an optimum
%We can see these gradient flows are very different
%Unlike the healthy gradient flow achieved by the VGG08 model (as shown in figure \ref{fig:grad_flow_08}) we could see in the gradient flow graph for the VGG38 model the majority of layers have consistent gradients of 0 (or very close to 0) with a spike in gradient magnitudes in the last 2 layers
%We can see that the gradient flows for these models are extremely different, with the most significant differences being that: the VGG38 model has almost no variability in gradients for the input layers, the majority of gradients in the layers of the VGG38 model are 0 (or very close to 0), and the gradients in the output layers of the VGG38 model have very high variability. Given the magnitude of differences found from these gradient flow visualisations it is evident that the problem behind VGG38's poor performance lies in the way that the weights for the majority of layers in the VGG38 network remain constant throughout all the epochs and do not converge towards an optimum.
%In figure \ref{fig:grad_flow_38} we can see the majority of layers have consistent gradients of 0 (or very close to 0) with a spike in gradient magnitudes in the last 2 layers. This is extremely characteristic of this problem as the backpropagation algorithm shrinks the gradients exponentially more as it goes from the last to the first layer resulting in the lower layers (the first 36 layers in this case) having consistent exponentially small gradients (very close to 0), and the higher layers (the last 2 layers in this case) having highly variable gradients throughout all the epochs \cite{bohra_2021}. The significant changes in the magnitude of the higher layers is due to the fact that no meaningful transformations were performed on the input data (since the weights stay constant and do not converge towards an optimum for the lower layers), and thus although gradient descent keeps trying to find the optimal setting of weights for the output layer by trying lots of different settings it will never find anything useful as the input data that is applied against these weights has no meaningful information (due to the effectively "random" transformations performed on input data in the lower layers).
%
%This problem is especially evident when comparing the performance of the VGG38 model with the shallower VGG8 model. We would expect the VGG38 to perform better than VGG8 due to the very large increase in the number of layers, however, this is not the case. We can see from figure \ref{fig:acc_curves} that the train and validation accuracies for the VGG38 model stagnate around 0 throughout all the training epochs, unlike the VGG8 model where the train and validation accuracies continue to increase with the number of epochs in a logarithmic fashion. From these results it is evident that there is something wrong with the VGG38 network, however, if we had no prior knowledge about this problem how would we be able to identify it's root cause? Given we know that the VGG8 model works and that weights are what determines a model's performance we can compare the gradient flow (a measure of the changes in weights throughout the epochs) of this model with that of VGG38 in order to understand what the gradients for a working network should look like, and how the VGG38 model differs. After inspecting figures \ref{fig:grad_flow_08} and \ref{fig:grad_flow_38}, we can see that the gradient flows for these models are extremely different, with the most significant differences being that: the VGG38 model has almost no variability in gradients for the input layers, the majority of gradients in the layers of the VGG38 model are 0 (or very close to 0), and the gradients in the output layers of the VGG38 model have very high variability. Given the magnitude of differences found from these gradient flow visualisations it is evident that the problem behind VGG38's poor performance lies in the way that the weights for the majority of layers in the VGG38 network remain constant throughout all the epochs and do not converge towards an optimum. Thus even if we had no prior knowledge of this problem it's diagnosis is rather straightforward when you use the right network performance and parameter setting visualisations
%Even without any prior knowledge about this problem we can still identify that something is going wrong with our network by comparing it with a working baseline. Given we know that the VGG8 model works (as discussed in the last paragraph) we can compare the gradient flow of this model with that of VGG38 in order to understand what the gradients for a working network should look like, and how the VGG38 model differs. After inspecting figures \ref{fig:grad_flow_08} and \ref{fig:grad_flow_38}, we can see that the gradient flows for these models are extremely different, with differences in: the general shape of the gradients throughout the layers, the variability of the gradients, and the magnitude of the gradients (given by the y-axis scale) across all the training epochs. Thus in practice we would be able to identify this problem lies in the way that the gradients of our layer weights remain constant at around 0 throughout all the epochs. %Thus even without any prior knowledge about this problem we should be able to identify that something is going wrong with our network by comparing it with a working baseline.
%Furthermore after inspection of figures \ref{fig:grad_flow_08} and \ref{fig:grad_flow_38}, we can see that the gradient flows for these models are extremely different, these differences are mainly characterised by the general shape of the gradients, the variability of the gradients, and the magnitude of the gradients (given by the y-axis scale) across all the training epochs.
%Given we know that the VGG8 model works we can compare the gradient flow of this model with that of VGG38 in order to understand what the gradients for a working network should look like, and how the VGG38 model differs. After inspecting figures \ref{fig:grad_flow_08} and \ref{fig:grad_flow_38}, we can see that the gradient flows for these models are extremely different, these differences are mainly characterised by the general shape of the gradients, the variability of the gradients, and the magnitude of the gradients (given by the y-axis scale) across all the training epochs.
%Even without analysing any of these figures we know that the Vanishing Gradient Problem occurs when we start to train very deep networks (with lots of layers) thus we can expect the model that suffers from this problem will be the one that has the most layers (the VGG38 model)
}
}
%% Question 2:
\newcommand{\questionTwo} {
\youranswer{
%Question 2 - Describe Batch Normalization (BN) by using equations. Consider both training and test time and explain how BN addresses the vanishing gradient problem. Note that you are not required to provide the derivation of gradients w.r.t. weights for BN weights.
%The average length of an answer to this question would be around 2/3 of a column in a 2-column page
Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This technique allows us to use much higher learning rates, be less careful about weight initialisation, and eliminate the need for smoothing techniques such as Dropout (as it offers some regularisation effect) \cite{machinelearningmastery_2019}.
Batch normalisation uses minibatch statistics in order to normalise the activations of each layer. It consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting. In other words, the operation lets the model learn the optimal scale and mean of each of the layer’s inputs. To zero-center and normalize the inputs, the algorithm needs to estimate each input’s mean and standard deviation. It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name “Batch Normalization”) \cite{ioffe2015batch}.
%\subsubsection{Training a network with batch normalisation}
%Batch-normalisation is incorporated into a network by adding an operation in the model just before or after the activation function of each hidden layer. This operation simply normalizes the input data using the Batch Normalizing Transform algorithm:
\underline{The Batch Normalizing Transform algorithm}\\
Consider a mini-batch $\mathcal{B}$ of size $m$. Since the normalization is applied to the inputs of each activation independently, let us focus on a particular activation $a^{k}$ and omit $k$ for clarity. We have $m$ inputs for this activation $a$ in the mini-batch $\mathcal{B}$:
\[\mathcal{B} = \{a_1...a_m\} \text{ }\text{ where }\text{ } a_i = w_i x\]
\begin{enumerate}
\item Compute the mean and variance of $\mathcal{B}$\\
$\mu_\mathcal{B} = \frac{1}{M} \sum_{i=1}^{M} u_i^m$
\+ \+ \+ and \+ \+ \+
$\sigma_\mathcal{B}^2 = \frac{1}{M}\sum_{i=1}^M (a_i - \mu_\mathcal{B})^2$
\item Normalise all the values in the mini-batch\\
$\hat a_i = \frac{a_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \epsilon}}$\\
where $\epsilon$ is a very small constant used for numerical stability
\item Shift and scale all the normalised values in the mini-batch using learned parameters $\gamma$ and $\beta$\\
$z_i = \gamma_i \hat a_i + \beta_i = \text{batchNorm}(a_i)$
\end{enumerate}
Once this algorithm has been completed for all the activations the $\gamma$ and $\beta$ parameters are then updated through gradient descent using an exponential moving average (to give more importance to the latest iterations).
%\subsubsection{Evaluating a network with batch normalisation}
So how are we going to evaluate this model now? Unlike the training phase, we may not have a full batch to feed into the model during the evaluation phase. To tackle this problem, we use the final means and standard deviations determined for each activation to normalize the testing data using the Batch Normalizing Transform algorithm \cite{johann_huber_2020}
%Batch normalization helps address the vanishing gradient problem by reducing the effect of internal covariate shift that the backpropagation algorithm induces on the data for very deep networks. Training deep neural networks is complicated due to a problem called "internal covariate shift". This problem occurs when the distribution of each layer's inputs varies during training as the parameters of the preceding layers change. This makes it infamously difficult to train models with saturating non-linearities as it necessitates lower learning rates and careful parameter setup. In order to address this problem we normalize the layer inputs by a process called batch normalisation.
%\subsubsection{How batch normalisation addresses the vanishing gradient problem}
%Batch normalisation uses minibatch statistics in order to normalise the activations of each layer. It consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new parameter vectors per layer: one for scaling, the other for shifting. In other words, the operation lets the model learn the optimal scale and mean of each of the layer’s inputs. To zero-center and normalize the inputs, the algorithm needs to estimate each input’s mean and standard deviation. It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name “Batch Normalization”).
%Training deep neural networks is complicated due to a phenomena called "internal covariate shift" which causes the vanishing/exploding gradient problem. This problem occurs when the distribution of each layer's inputs varies during training as the parameters of the preceding layers change. This makes it infamously difficult to train models with saturating non-linearities as it necessitates lower learning rates and careful parameter setup. In order to address this problem we normalize the layer inputs by a process called batch normalisation.
}
}
%% Question 3:
\newcommand{\questionThree} {
\youranswer{
%Question 3 - Describe Residual Connections (RC) by using equations. Consider both training and test time and explain how RC address the vanishing gradient problem.
%The average length of an answer to this question would be around 1/2 of a column in a 2-column page
%What is RC *(explain equations)*:
Residual connections (or skip connections) are used to allow gradients to flow through a network directly without passing through non-linear activation functions. Non-linear activation functions, by nature of being non-linear, cause the gradients to vanish/explode for very deep neural networks. Residual connections in practice essentially form a "bus" which flows right the way through the network helping to prevent vanishing/exploding gradients by making these paths carry gradient throughout the entire network. Each block of network layers taps the values at a point along the bus, and then adds values onto the bus. This means that the blocks do affect the gradients, and conversely, affect the forward output values too \cite{rnns_2018}.
The principal idea behind residual connections is that is easier for deep neural networks to optimize residual mappings of the input rather than the original, unreferenced mapping of the input \cite{he2016deep}. These residual mappings can then be used to calculate the desired underlying mapping of the input:\\
Let $H(x)$ be the desired underlying mapping. We let the stacked non-linear layers in our network fit another mapping of $F(x) = H(x) - x$. Thus we can use this to recast the original mapping into $F(x) + x$ which is equivalent to $H(x)$.
\emph{*Note since we are performing an addition operation across different layers we must ensure the dimensions of $F(x)$ and $x$ match. This is typically ensured by just creating these residual connections in a single block of convolutional layers.}
Thus to implement this technique in training we simply create a connection from the activation output in layer $n$ to the activation input in layer $n+2$. This ultimately allows us to retain the activation output from layer $n$, skip the non-linear activation in layer $n+1$, and add this output to the activation input in layer $n+2$ (thus forming the residual mapping). This is then repeated all the way down the network.
Now, due to the fact these residual connections were used to alter layer inputs throughout the network we do not have to perform any changes on our model for evaluation! Due to the fact these residual connections were used in training they must be used in evaluation too to ensure the model treats inputs in the same way
%Let us consider $H(x)$ as an underlying mapping to be fit by a few stacked layers (not necessarily the entire network), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., $H(x) − x$ (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate $H(x)$, we explicitly let these layers approximate a residual function $F(x) = H(x) − x$. The original function thus becomes $F(x)+x$. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.
%Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers
%What is RC - train time *(explain equations)*:
%What is RC - test time *(explain equations)*:
%How does RC address VGP:
}
}
%% Question 4:
\newcommand{\questionFour} {
\youranswer{
%Question 4 - Present and discuss the experiment results (all of the results and not just the ones you had to fill in) in Table 1 and Figures 4 and 5 (you may use any of the other Figures if you think they are relevant to your analysis). You will have to determine what data are relevant to the discussion, and what information can be extracted from it. Also, discuss what further experiments you would have ran on any combination of VGG08, VGG38, BN, RC in order to
%\begin{itemize}
% \item Improve performance of the model trained (explain why you expect your suggested experiments will help with this).
% \item Learn more about the behaviour of BN and RC (explain what you are trying to learn and how).
%\end{itemize}
%The average length for an answer to this question is approximately 1 of the columns in a 2-column page
\begin{table*}[t]
\centering
\begin{tabular}{lr|ccccc}
\toprule
Model & LR & \# Params & Val loss & Val acc & Test loss & Test acc \\
\midrule
VGG38 & 1e-3 & 336 K & 4.61 & 0.64 & 4.61 & 1\\
VGG38 BN+RC & 1e-2 & 339 K & 1.79 & 61.8 & 1.83 & 60.37
\bottomrule
\end{tabular}
\caption{Experiment evaluation results}
\label{tab:q4_final_scores}
\end{table*}
From the results in tables \ref{tab:CIFAR_results} and \ref{tab:q4_final_scores}, it is evident that the combination of batch normalisation and residual connections was a great solution to the vanishing/exploding gradient problem given VGG38 BN+RC produced the best model with 61.8\% validation accuracy.
This is further reinforced when comparing the gradient flow of the \emph{working} VGG08 model in figure \ref{fig:grad_flow_08} with that of the VGG38 BN+RC model in figure \ref{fig:grad_flow_bestModel}. We can see from this comparison that the shapes for these gradient flow graphs are very similar with high gradient variability throughout all the epochs in all the layers (illustrated by the thick band of blue lines), and spikes in gradients at the input and output layers. Conversely, given our comparison of the gradient flow for the VGG08 and VGG38 models in section 2 this finding also indicates the large disparity between the gradient flow of the VGG38 model in figure \ref{fig:grad_flow_38} and the VGG38 BN+RC model in figure \ref{fig:grad_flow_bestModel}. The high similarity of gradient flows between the working models (VGG08 and VGG38 BN+RC) and the large disparities of gradient flow between the VGG38 models (VGG38 and VGG38 BN+RC) is a direct indication of how batch normalisation and residual connections can fix the vanishing/exploding gradients problem for a deep neural network.
%The difference of gradient flows between the old and new VGG38 models is a direct indication of
%This is further reinforced when comparing the gradient flow of the VGG38 model in figure \ref{fig:grad_flow_38} with that of the VGG38 BN+RC model in figure \ref{fig:grad_flow_bestModel}. From this comparison we can see that unlike the gradient flow for the old network there is high variability of gradients throughout the epochs in all the layers (illustrated by the thick band of blue lines), and the gradient flow spikes at the input and output layers.
In the future, if I were to run further experiments on our VGG08 and VGG38 models I would first like to experiment more with the configuration of batch normalisation and residual connections on our VGG38 model to get a better understanding of how well these techniques address the vanishing/exploding gradient problem in various settings. For batch normalisation I would like to experiment with different placements of batch normalisation layers w.r.t. the activation function for varying types of activation functions, and for residual connections I would like to experiment with the combinations of residual connections with varying skip sizes used in our network.
%Batch normalisation may be used on the inputs to the layer before or after the activation function in the previous layer. It is typically placed after the activation function for s-shaped functions such as tanh and sigmoid, and before the activation function for activations that may result in non-Gaussian distributions such as the rectified linear activation function. However, these are merely just conventions and do not hold true across all types of networks so I plan to test all possible configurations. Thus, for further experiments I would like to implement the VGG38 model with the batch normalisation layer after the activation function using both types of activations (such as tanh and Leaky RELU), and with the batch normalisation layer before the activation function using an s-shaped activation function (such as tanh).
%Conventionally, only 1 residual connection that skips a single activation is placed in a network, however, to stop here would not be useful in identifying any further expansions upon this technique. Thus, for further experiments I would like to see how using combinations of residual connections with varying skip sizes could affect the performance of our network. Since residual connections are used to inflate the gradients throughout the network by fitting the residual function, this could be improved i
Now, using the optimal configuration for batch normalisation and residual connections found in the previous experiments to optimise our VGG38 BN+RC model even further we can still optimise the input parameters of our network. These input parameters include: the number of epochs used, the batch size, the number of filters, the number of blocks in a stage, the number of stages, and the random seed used for weight initialisation. To be able to produce the optimal solution we would need to be able to optimise these parameters in conjunction. Evidently this would be very computationally expensive to optimise given each parameter combination would need to be used to train the entire network. Thus to solve this I would make use of a technique called Gaussian Processes to minimise the amount of tests to be run. This technique observes the performances of previously observed parameter settings and uses Bayesian optimisation to choose the next parameter settings which maximises the probability of improving performance
%
%Firstly, I would optimise the placement of the batch normalisation layers in my network. Batch normalisation may be used on the inputs to the layer before or after the activation function in the previous layer. It is typically placed after the activation function for s-shaped functions such as tanh and sigmoid, and before the activation function for activations that may result in non-Gaussian distributions such as the rectified linear activation function. Thus, for further experiments I would like to implement models with varying BN layer placements, activation functions, and VGP solving techniques.
%
%pre-processing?
%try different network types? MLP RNNs resnets
%In the future, to make our models perform even better there are many things we could still optimise further. These include optimisations on the input parameters of our network such as: the number of epochs used, the batch size, the number of filters, the number of blocks in a stage, the number of stages, and the random seed used for weight initialisation. However, these could also include optimisations on the structure of our network such as: the type of activation function used, the placement of batch normalisation layers, the number of convolutional layers in a block, or the combinations of varying techniques that are used to solve the vanishing/exploding gradient problem.
%
%I would optimise the placement of the batch normalisation layers in my network. Batch normalisation may be used on the inputs to the layer before or after the activation function in the previous layer. It is typically placed after the activation function for s-shaped functions such as tanh and sigmoid, and before the activation function for activations that may result in non-Gaussian distributions such as the rectified linear activation function. Thus, for further experiments I would like to implement models with varying BN layer placements, activation functions, and VGP solving techniques.
}
}
%% Question 5:
\newcommand{\questionFive} {
\youranswer{
%Question 5 - Briefly draw your conclusions based on the results from the previous sections (what are the take-away messages?) and conclude your report with a recommendation for future work.
%Good recommendations for future work also draw on the broader literature (the papers already referenced are good starting points). Great recommendations for future work are not just incremental (an example of an incremental suggestion would be: "we could also train with different learning rates") but instead also identify meaningful questions or, in other words, questions with answers that might be somewhat more generally applicable.
%For example, \citep{huang2017densely} end with \begin{quote}``Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features, e.g., [4,5].''\end{quote}
%while \cite{bengio1993problem} state in their conclusions that \begin{quote}``There remains theoretical questions to be considered, such as whether the problem with simple gradient descent discussed in this paper would be observed with chaotic attractors that are not hyperbolic.\\\end{quote}
%The length of this question description is indicative of the average length of a conclusion section
In conclusion, it is evident that when trying to train very deep neural networks both batch normalisation and residual connections serve as great techniques to address the vanishing/exploding gradient problem. This enables our deep models to outperform the shallower ones and thus negate the effect of the degradation problem. This is evident from the 59.37\% test accuracy improvement the VGG38 BN+RC model gave from the original VGG38 model.
In the future, to further understand the vanishing/exploding gradient problem I would like to perform further experiments on the VGG38 model by playing with the the configuration of batch normalisation and residual connections, and experimenting with alternative techniques known for solving the vanishing/exploding gradient problem (in combination and alone). This would be very useful as it would allow us to compare the effectiveness of varying techniques, and the relevance of them in different domains of neural network classification. These alternative techniques include: using modified optimisation algorithms (such as RMSProp and gradient clipping), using modified hidden unit transfer functions (such as LSTM), or using weight initialization (such as Xavier initialization).
Firstly, to optimise my batch normalisation configuration even further I would optimise the placement of the batch normalisation layers in my network. Batch normalisation may be used on the inputs to the layer before or after the activation function in the previous layer. It is typically placed after the activation function for s-shaped functions such as tanh and sigmoid, and before the activation function for activations that may result in non-Gaussian distributions such as the rectified linear activation function \cite{machinelearningmastery_2019}. Thus, for further experiments I would like to implement the VGG38 model with the batch normalisation layer after the activation function using both types of activations (such as tanh and Leaky RELU), and with the batch normalisation layer before the activation function using an s-shaped activation function (such as tanh).
Secondly, conventionally, only 1 residual connection that skips a single activation is placed in a network, however, to stop here would not be useful in identifying any further expansions upon this technique. Thus, for further experiments I would like to see how using combinations of residual connections with varying skip sizes could affect the performance of our network. I would also like to try testing this using a ResNet architecture (as described in \cite{he2016deep}) as this produced a significant performance increase on the VGG architecture due it's higher suitability for learning residual mappings.
Next, I would try using a modified optimisation algorithm such as RMSProp or gradient clipping. Both of these techniques act as extensions to the gradient descent optimization algorithm and are designed to prevent large noisy gradients by using methods such as gradient normalisation.
Fourthly, I would try using a modified hidden unit transfer function such as the Long Short Term Memory unit. LSTM networks are a special kind of RNN, capable of learning long-term dependencies, and are explicitly designed to avoid the long-term dependency problem (where the gap between the relevant information and the point where it is needed become very large in an RNN), thus these would be an excellent option to test as they may also allow the network to generalise better by using these long-term dependencies \cite{lstm_2015}.
Lastly, I would try proper weight initialization. The main issue behind random weight initialization is that if the desired output from our activation is on the opposite side from where it saturated, then during training, when SGD updates the weights in attempts to influence the activation output, it will only make very small changes in the value of this activation output, barely even incrementally moving in the right direction. Researcher Xavier Glorot proposed a way alleviate this problem. For the proper flow of the signal Glorot argues that the variance of outputs of each layer should be equal to the variance of its inputs, and the gradients should have equal variance before and after flowing through a layer in the reverse direction. Although it is impossible for both conditions to hold for any layer in the network until and unless the number of inputs to the layer ($fan_{in}$) is equal to the number of neurons in the layer ($fan_{out}$), they proposed a well-proven compromise that works incredibly well in practice: randomly initialize the connection weights for each layer in the network as described in the equation $fan_{avg} = (fan_{in} + fan_{out})/2$. These values can then be used to scale the variance for the weights as is desired, a variance of: $1/fan_{avg}$ is used for the Xavier initialization (typically used with tanh, logistic, and softmax activation functions), $2/fan_{in}$ in the He initialization (typically used with RELU and variant activation functions), and $1/fan_{in}$ for the LeCun initialization (used for the SELU activation function) \cite{bohra_2021}
%Lastly, testing how different neural network architectures with 38 layers handle this vanishing/exploding gradient problem would be extremely useful as it would give us an insight into which architectures work best against this problem. This could prove extremely beneficial as once we have found the best performing architecture against this problem we could implement some of the known vanishing/exploding gradient problem solving techniques on this architecture. However, we must note that performing such comparisons can be rather unreliable due to the extremely high variability of setup in these different architectures, and thus in order to reliably compare these model architectures we would need to optimise each of them as far as possible. The main architectures we would test would be artificial neural networks (in particular residual networks), and recurrent neural networks
}
}
%% - - - - - - - - - - - - FIGURES - - - - - - - - - - - -
%% Question Figure 3:
\newcommand{\questionFigureThree} {
\youranswer{%Question Figure 3 - Replace this image with a figure depicting the average gradient across layers, for the VGG38 model.
%
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/vgg38-grad-flow.pdf}
\caption{Gradient Flow for VGG38}
\label{fig:grad_flow_38}
\end{figure}
}
}
%% Question Figure 4:
\newcommand{\questionFigureFour} {
\youranswer{%Question Figure 4 - Replace this image with a figure depicting the training curves for the model with the best performance across experiments you have available. (Also edit the caption accordingly).
%
\begin{figure}
\centering
\begin{subfigure}[b]{\linewidth}
\centering
\includegraphics[width=\textwidth]{figures/best_model_accuracy_performance.pdf}
\caption{Accuracy by epoch}
\end{subfigure}
%
\begin{subfigure}[b]{\linewidth}
\centering
\includegraphics[width=\textwidth]{figures/best_model_loss_performance.pdf}
\caption{Loss by epoch}
\end{subfigure}
\caption{Training and validation curves in terms of classification
accuracy (a) and loss (b) on the CIFAR100 dataset
for the best performing model: VGG38 BN+RC}
\label{fig:grad_flow_bestModel}
\end{figure}
}
}
%% Question Figure 5:
\newcommand{\questionFigureFive} {
\youranswer{%Question Figure 5 - Replace this image with a figure depicting the average gradient across layers, for the model with the best performance across experiments you have available. (Also edit the caption accordingly).
%
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/best_model_gradient_flow.pdf}
\caption{Gradient flow graphs for the best performing model: VGG38 BN+RC}
\label{fig:grad_flow_bestModel}
\end{figure}
}
}
%% - - - - - - - - - - - - TABLES - - - - - - - - - - - -
%% Question Table 1:
\newcommand{\questionTableOne} {
\youranswer{
%Question Table 1 - Fill in Table 1 with the results from your experiments on
%\begin{enumerate}
% \item \textit{VGG38 BN (LR 1e-3)}, and
% \item \textit{VGG38 BN + RC (LR 1e-2)}.
%\end{enumerate}
%
\begin{table*}[t]
\centering
\begin{tabular}{lr|ccccc}
\toprule
Model & LR & \# Params & Train loss & Train acc & Val loss & Val acc \\
\midrule
VGG08 & 1e-3 & 60 K & 1.74 & 51.59 & 1.95 & 46.84 \\
VGG38 & 1e-3 & 336 K & 4.61 & 00.01 & 4.61 & 00.01 \\
VGG38 BN & 1e-3 & 339 K & 1.47 & 57.57 & 1.98 & 47.52 \\
VGG38 RC & 1e-3 & 336 K & 1.33 & 61.52 & 1.84 & 52.32 \\
VGG38 BN + RC & 1e-3 & 339 K & 1.26 & 62.99 & 1.73 & 53.76 \\
VGG38 BN & 1e-2 & 339 K & 1.70 & 52.28 & 1.99 & 46.72 \\
VGG38 BN + RC & 1e-2 & 339 K & 0.57 & 82.02 & 1.79 & 61.8 \\
\bottomrule
\end{tabular}
\caption{Experiment results (number of model parameters, Training and Validation loss and accuracy) for different combinations of VGG08, VGG38, Batch Normalisation (BN), and Residual Connections (RC), LR is learning rate.}
\label{tab:CIFAR_results}
\end{table*}
}
}
%% END of YOUR ANSWERS