Added concept of Linear Unit, Layers, Stacking Dense Layers, an…

…d `Dropout and Batch Normalization` to `Neural Networks` Signed-off-by: Ayush Joshi <ayush854032@gmail.com>
joshiayush · Nov 26, 2023 · 64337d7 · 64337d7
1 parent 429f1f1
commit 64337d7
Show file tree

Hide file tree

Showing 4 changed files with 346 additions and 11 deletions.
diff --git a/docs/ml/Classification.md b/docs/ml/Classification.md
@@ -289,7 +289,7 @@ An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering
 
 <div align="center">
 
-<img src="https://developers.google.com/static/machine-learning/crash-course/images/ROCCurve.svg" width="400" height="400" />
+<img src="https://developers.google.com/static/machine-learning/crash-course/images/ROCCurve.svg" />
 
 <strong>Figure 4. TP vs. FP rate at different classification thresholds.</strong>
 
@@ -303,7 +303,7 @@ To compute the points in an ROC curve, we could evaluate a logistic regression m
 
 <div align="center">
 
-<img src="https://developers.google.com/static/machine-learning/crash-course/images/AUC.svg" width="400" height="400" />
+<img src="https://developers.google.com/static/machine-learning/crash-course/images/AUC.svg" />
 
 <strong>Figure 5. AUC (Area under the ROC Curve).</strong>
 

diff --git a/docs/ml/Neural-Networks.md b/docs/ml/Neural-Networks.md
@@ -40,7 +40,51 @@ To see how neural networks might help with nonlinear problems, let's start by re
 
 Each blue circle represents an input feature, and the green circle represents the weighted sum of the inputs.
 
-How can we alter this model to improve its ability to deal with nonlinear problems?
+### The Linear Unit
+
+So let's begin with the fundamental component of a neural network: the individual neuron. As a diagram, a **neuron** (or **unit**) with one input looks like:
+
+<div align='center'>
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/mfOlDR6.png" />
+
+<strong><i>The Linear Unit:</i> y = wx + b</strong>
+
+</div>
+
+The input is $x$. Its connection to the neuron has a **weight** which is $w$. Whenever a value flows through a connection, you multiply the value by the connection's weight. For the input $x$, what reaches the neuron is $w * x$. A neural network "learns" by modifying its weights.
+
+The $b$ is a special kind of weight we call it **bias**. The bias doesn't have any input data associated with it; instead, we put a $1$ in the diagram so that the value that reaches the neuron is just $b$ (since $1 * b = b$). The bias enables the neuron to modify the output independently of its inputs.
+
+The $y$ is the value the neuron ultimately outputs. To get the output, the neuron sums up all the values it receives through its connections. This neuron's activation is $y = w * x + b$, or as a formula $y=wx+b$.
+
+### Multiple Inputs
+
+In the previous section we saw how can we handle a single input using *The Linear Unit*, but what if we wanted to expand our model to include more inputs? That's easy enough. We can just add more input connections to the neuron, one for each additional feature. To find the output, we would multiply each input to its connection weight and then add them all together.
+
+<div align='center'>
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/vyXSnlZ.png" />
+
+<strong>A linear unit with three inputs.</strong>
+
+</div>
+
+The formula for this neuron would be $y=w0x0+w1x1+w2x2+b$. A linear unit with two inputs will fit a plane, and a unit with more inputs than that will fit a hyperplane.
+
+### Layers
+
+Neural networks typically organize their neurons into **layers**. When we collect together linear units having a common set of inputs we get a **dense** layer.
+
+<div align='center'>
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/2MA4iMV.png" />
+
+<strong>A dense layer of two linear units receiving two inputs and a bias.</strong>
+
+</div>
+
+You could think of each layer in a neural network as performing some kind of relatively simple transformation. Through a deep stack of layers, a neural network can transform its inputs in more and more complex ways. In a well-trained neural network, each layer is a transformation getting us a little bit closer to a solution.
 
 ### Hidden Layers
 
@@ -69,6 +113,14 @@ Is this model still linear? Yes, it is. When you express the output as a functio
 
 ### Activation Functions
 
+<div align='center'>
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/OLSUEYT.png" />
+
+<i>Without activation functions, neural networks can only learn linear relationships. In order to fit curves, we'll need to use activation functions.</i>
+
+</div>
+
 To model a nonlinear problem, we can directly introduce a nonlinearity. We can pipe each hidden layer node through a nonlinear function.
 
 In the model represented by the following graph, the value of each node in Hidden Layer 1 is transformed by a nonlinear function before being passed on to the weighted sums of the next layer. This nonlinear function is called the activation function.
@@ -81,6 +133,26 @@ In the model represented by the following graph, the value of each node in Hidde
 
 </div>
 
+An **activation function** is simply some function we apply to each of a layer's outputs (its activations). The most common is the rectifier function  $max(0,x)$.
+
+<div align='center'>
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/aeIyAlF.png" />
+
+</div>
+
+The rectifier function has a graph that's a line with the negative part "rectified" to zero. Applying the function to the outputs of a neuron will put a bend in the data, moving us away from simple lines.
+
+When we attach the rectifier to a linear unit, we get a **rectified linear unit** or **ReLU**. (For this reason, it's common to call the rectifier function the "ReLU function".) Applying a ReLU activation to a linear unit means the output becomes $max(0, w * x + b)$, which we might draw in a diagram like:
+
+<div align='center'>
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/eFry7Yu.png" />
+
+<i>A rectified linear unit.</i>
+
+</div>
+
 Now that we've added an activation function, adding layers has more impact. Stacking nonlinearities on nonlinearities lets us model very complicated relationships between the inputs and the predicted outputs. In brief, each layer is effectively learning a more complex, higher-level function over the raw inputs. If you'd like to develop more intuition on how this works, see [Chris Olah's excellent blog post](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/).
 
 ### Common Activation Functions
@@ -119,6 +191,22 @@ $$\sigma(w.x + b)$$
 
 TensorFlow provides out-of-the-box support for many activation functions. You can find these activation functions within TensorFlow's [list of wrappers for primitive neural network operations](https://www.tensorflow.org/api_docs/python/tf/nn). That said, we still recommend starting with ReLU.
 
+### Stacking Dense Layers
+
+Now that we have some nonlinearity, let's see how we can stack layers to get complex data transformations.
+
+<div align='center'>
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/Y5iwFQZ.png" />
+
+<i>A stack of dense layers makes a "fully-connected" network.</i>
+
+</div>
+
+The layers before the output layer are sometimes called **hidden** since we never see their outputs directly.
+
+Now, notice that the final (output) layer is a linear unit (meaning, no activation function). That makes this network appropriate to a regression task, where we are trying to predict some arbitrary numeric value. Other tasks (like classification) might require an activation function on the output.
+
 ### Summary
 
 Now our model has all the standard components of what people usually mean when they say "neural network":

diff --git a/docs/ml/Training-Neural-Networks.md b/docs/ml/Training-Neural-Networks.md
@@ -36,4 +36,20 @@ Yet another form of regularization, called **Dropout**, is useful for neural net
 
 * 0.0 = No dropout regularization.
 * 1.0 = Drop out everything. The model learns nothing.
-* Values between 0.0 and 1.0 = More useful.
+* Values between 0.0 and 1.0 = More useful.
+
+<div align="center">
+
+<img src="https://storage.googleapis.com/kaggle-media/learn/images/a86utxY.gif" />
+
+<i>Here, 50% dropout has been added between the two hidden layers.</i>
+
+</div>
+
+### Batch Normalization (batchnorm)
+
+With neural networks, it's generally a good idea to put all of your data on a common scale. The reason is that SGD will shift the network weights in proportion to how large an activation the data produces. Features that tend to produce activations of very different sizes can make for unstable training behavior.
+
+Now, if it's good to normalize the data before it goes into the network, maybe also normalizing inside the network would be better! In fact, we have a special kind of layer that can do this, the **batch normalization layer**. A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its own mean and standard deviation, and then also putting the data on a new scale with two trainable rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.
+
+Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also help prediction performance). Models with batchnorm tend to need fewer epochs to complete training. Moreover, batchnorm can also fix various problems that can cause the training to get "stuck". Consider adding batch normalization to your models, especially if you're having trouble during training.