Deep Learning

Lecture 2: Multi-layer perceptron

Prof. Gilles Louppe
g.louppe@uliege.be

Today

Explain and motivate the basic constructs of neural networks.

From linear discriminant analysis to logistic regression
Stochastic gradient descent
From logistic regression to the multi-layer perceptron
Vanishing gradients and rectified networks
Universal approximation theorem

Neural networks

Threshold Logic Unit

.grid[ .kol-3-5[ The Threshold Logic Unit (McCulloch and Pitts, 1943) $$f(\mathbf{x}) = 1_{\{\sum_i w_i x_i + b \geq 0\}},$$ with Boolean inputs $x_i$, weights $w_i$ and bias $b$, was the first mathematical model for a neuron.

This unit can implement

$\text{or}(a,b) = 1_{\{a+b - 0.5 \geq 0\}}$,
$\text{and}(a,b) = 1_{\{a+b - 1.5 \geq 0\}}$,
$\text{not}(a) = 1_{\{-a + 0.5 \geq 0\}}$.

Therefore, any Boolean function can be built with such units. ] .kol-2-5.width-100[] ]

.footnote[Credits: McCulloch and Pitts, A logical calculus of ideas immanent in nervous activity, 1943.]

Perceptron

The perceptron (Rosenblatt, 1957) $$f(\mathbf{x}) = \begin{cases} 1 &\text{if } \sum_i w_i x_i + b \geq 0 \\ 0 &\text{otherwise} \end{cases}$$ is very similar, except that the inputs are real.

This model was originally motivated by biology, with $w_i$ being synaptic weights and $x_i$ and $f$ firing rates. .center.width-65[]

???

A perceptron is a signal transmission network consisting of sensory units (S units), association units (A units), and output or response units (R units). The ‘retina’ of the perceptron is an array of sensory elements (photocells). An S-unit produces a binary output depending on whether or not it is excited. A randomly selected set of retinal cells is connected to the next level of the network, the A units. As originally proposed there were extensive connections among the A units, the R units, and feedback between the R units and the A units.

In essence an association unit is also an MCP neuron which is 1 if a single specific pattern of inputs is received, and it is 0 for all other possible patterns of inputs. Each association unit will have a certain number of inputs which are selected from all the inputs to the perceptron. So the number of inputs to a particular association unit does not have to be the same as the total number of inputs to the perceptron, but clearly the number of inputs to an association unit must be less than or equal to the total number of inputs to the perceptron. Each association unit's output then becomes the input to a single MCP neuron, and the output from this single MCP neuron is the output of the perceptron. So a perceptron consists of a "layer" of MCP neurons, and all of these neurons send their output to a single MCP neuron.

The Mark I Percetron (Frank Rosenblatt).

The Perceptron

Let us define the (non-linear) activation function:

$$\text{sign}(x) = \begin{cases} 1 &\text{if } x \geq 0 \\ 0 &\text{otherwise} \end{cases}$$ .center[]

The perceptron classification rule can be rewritten as $$f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b).$$

Computational graphs

.grid[ .kol-3-5[.width-90[]] .kol-2-5[ The computation of $$f(\mathbf{x}) = \text{sign}(\sum_i w_i x_i + b)$$ can be represented as a computational graph where

white nodes correspond to inputs and outputs;
red nodes correspond to model parameters;
blue nodes correspond to intermediate operations. ] ]

???

Draw the NN diagram.

In terms of tensor operations, $f$ can be rewritten as $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T \mathbf{x} + b),$$ for which the corresponding computational graph of $f$ is:

.center.width-70[]

???

Ask about their intuition on the intuitive meaning of $f(x)$ (i.e., the product as a similarity measure).

Linear discriminant analysis

Consider training data $(\mathbf{x}, y) \sim p_{X,Y}$, with

$\mathbf{x} \in \mathbb{R}^p$,
$y \in \{0,1\}$.

Assume class populations are Gaussian, with same covariance matrix $\Sigma$ (homoscedasticity):

$$p(\mathbf{x}|y) = \frac{1}{\sqrt{(2\pi)^p |\Sigma|}} \exp \left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_y)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_y) \right)$$

???

Switch to blackboard.

Using the Bayes' rule, we have:

$$\begin{aligned} p(y=1|\mathbf{x}) &= \frac{p(\mathbf{x}|y=1) p(y=1)}{p(\mathbf{x})} \\\ &= \frac{p(\mathbf{x}|y=1) p(y=1)}{p(\mathbf{x}|y=0)p(y=0) + p(\mathbf{x}|y=1)p(y=1)} \\\ &= \frac{1}{1 + \frac{p(\mathbf{x}|y=0)p(y=0)}{p(\mathbf{x}|y=1)p(y=1)}}. \end{aligned}$$

--

It follows that with

$$\sigma(x) = \frac{1}{1 + \exp(-x)},$$

we get

$$p(y=1|\mathbf{x}) = \sigma\left(\log \frac{p(\mathbf{x}|y=1)}{p(\mathbf{x}|y=0)} + \log \frac{p(y=1)}{p(y=0)}\right).$$

Therefore,

$$\begin{aligned} &p(y=1|\mathbf{x}) \\\ &= \sigma\left(\log \frac{p(\mathbf{x}|y=1)}{p(\mathbf{x}|y=0)} + \underbrace{\log \frac{p(y=1)}{p(y=0)}}_{a}\right) \\\ &= \sigma\left(\log p(\mathbf{x}|y=1) - \log p(\mathbf{x}|y=0) + a\right) \\\ &= \sigma\left(-\frac{1}{2}(\mathbf{x} - \mathbf{\mu}_1)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_1) + \frac{1}{2}(\mathbf{x} - \mathbf{\mu}_0)^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}_0) + a\right) \\\ &= \sigma\left(\underbrace{(\mu_1-\mu_0)^T \Sigma^{-1}}_{\mathbf{w}^T}\mathbf{x} + \underbrace{\frac{1}{2}(\mu_0^T \Sigma^{-1} \mu_0 - \mu_1^T \Sigma^{-1} \mu_1) + a}_{b} \right) \\\ &= \sigma\left(\mathbf{w}^T \mathbf{x} + b\right) \end{aligned}$$

Note that the sigmoid function $$\sigma(x) = \frac{1}{1 + \exp(-x)}$$ looks like a soft heavyside:

Therefore, the overall model $f(\mathbf{x};\mathbf{w},b) = \sigma(\mathbf{w}^T \mathbf{x} + b)$ is very similar to the perceptron.

.center.width-70[]

This unit is the main primitive of all neural networks!

Logistic regression

Same model $$p(y=1|\mathbf{x}) = \sigma\left(\mathbf{w}^T \mathbf{x} + b\right)$$ as for linear discriminant analysis.

But,

ignore model assumptions (Gaussian class populations, homoscedasticity);
instead, find $\mathbf{w}, b$ that maximizes the likelihood of the data.

???

Switch to blackboard.

We have,

$$\begin{aligned} &\arg \max_{\mathbf{w},b} p(\mathbf{d}|\mathbf{w},b) \\\ &= \arg \max_{\mathbf{w},b} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y=y_i|\mathbf{x}_i, \mathbf{w},b) \\\ &= \arg \max_{\mathbf{w},b} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \sigma(\mathbf{w}^T \mathbf{x}_i + b)^{y_i} (1-\sigma(\mathbf{w}^T \mathbf{x}_i + b))^{1-y_i} \\\ &= \arg \min_{\mathbf{w},b} \underbrace{\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} -{y_i} \log\sigma(\mathbf{w}^T \mathbf{x}_i + b) - {(1-y_i)} \log (1-\sigma(\mathbf{w}^T \mathbf{x}_i + b))}_{\mathcal{L}(\mathbf{w}, b) = \sum_i \ell(y_i, \hat{y}(\mathbf{x}_i; \mathbf{w}, b))} \end{aligned}$$

This loss is an instance of the cross-entropy $$H(p,q) = \mathbb{E}_p[-\log q]$$ for $p=p_{Y|\mathbf{x}_i}$ and $q=p_{\hat{Y}|\mathbf{x}_i}$.

When $Y$ takes values in $\{-1,1\}$, a similar derivation yields the logistic loss $$\mathcal{L}(\mathbf{w}, b) = -\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \log \sigma\left(y_i (\mathbf{w}^T \mathbf{x}_i + b))\right).$$

Multi-layer perceptron

So far we considered the logistic unit $h=\sigma\left(\mathbf{w}^T \mathbf{x} + b\right)$, where $h \in \mathbb{R}$, $\mathbf{x} \in \mathbb{R}^p$, $\mathbf{w} \in \mathbb{R}^p$ and $b \in \mathbb{R}$.

These units can be composed in parallel to form a layer with $q$ outputs: $$\mathbf{h} = \sigma(\mathbf{W}^T \mathbf{x} + \mathbf{b})$$ where $\mathbf{h} \in \mathbb{R}^q$, $\mathbf{x} \in \mathbb{R}^p$, $\mathbf{W} \in \mathbb{R}^{p\times q}$, $b \in \mathbb{R}^q$ and where $\sigma(\cdot)$ is upgraded to the element-wise sigmoid function.

.center.width-70[![](figures/lec2/graphs/layer.svg)]

???

Draw the NN diagram.

Similarly, layers can be composed in series, such that: $$\begin{aligned} \mathbf{h}_0 &= \mathbf{x} \\ \mathbf{h}_1 &= \sigma(\mathbf{W}_1^T \mathbf{h}_0 + \mathbf{b}_1) \\ ... \\ \mathbf{h}_L &= \sigma(\mathbf{W}_L^T \mathbf{h}_{L-1} + \mathbf{b}_L) \\ f(\mathbf{x}; \theta) = \hat{y} &= \mathbf{h}_L \end{aligned}$$ where $\theta$ denotes the model parameters $\{ \mathbf{W}_k, \mathbf{b}_k, ... | k=1, ..., L\}$.

This model is the multi-layer perceptron, also known as the fully connected feedforward network.

???

Draw the NN diagram.

Output layer

For binary classification, the width $q$ of the last layer $L$ is set to $1$, which results in a single output $h_L \in [0,1]$ that models the probability $p(y=1|\mathbf{x})$.
For multi-class classification, the sigmoid activation $\sigma$ in the last layer can be generalized to produce a vector $\mathbf{h}_L \in \bigtriangleup^C$ of probability estimates $p(y=i|\mathbf{x})$.

This activation is the $\text{Softmax}$ function, where its $i$-th output is defined as $$\text{Softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^C \exp(z_j)},$$ for $i=1, ..., C$.

Regression

For regression problems, one usually starts with the assumption that $$p(y|\mathbf{x}) = \mathcal{N}(y; \mu=f(\mathbf{x}; \theta), \sigma^2=1),$$ where $f$ is parameterized with a neural network which last layer does not contain any final activation.

We have, $$\begin{aligned} &\arg \max_{\theta} p(\mathbf{d}|\theta) \\ &= \arg \max_{\theta} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y=y_i|\mathbf{x}_i, \theta) \\ &= \arg \min_{\theta} -\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \log p(y=y_i|\mathbf{x}_i, \theta) \\ &= \arg \min_{\theta} -\sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \log\left( \frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(y_i - f(\mathbf{x};\theta))^2) \right)\\ &= \arg \min_{\theta} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} (y_i - f(\mathbf{x};\theta))^2, \end{aligned}$$ which recovers the common squared error loss $\ell(y, \hat{y}) = (y-\hat{y})^2$.

(demo)

Training neural networks

In general, the loss functions do not admit a minimizer that can be expressed analytically in closed form.
However, a minimizer can be found numerically, using a general minimization technique such as gradient descent.

Gradient descent

Let $\mathcal{L}(\theta)$ denote a loss function defined over model parameters $\theta$ (e.g., $\mathbf{w}$ and $b$).

To minimize $\mathcal{L}(\theta)$, gradient descent uses local linear information to iteratively move towards a (local) minimum.

For $\theta_0 \in \mathbb{R}^d$, a first-order approximation around $\theta_0$ can be defined as $$\hat{\mathcal{L}}(\epsilon; \theta_0) = \mathcal{L}(\theta_0) + \epsilon^T\nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{2\gamma}||\epsilon||^2.$$

.center.width-60[]

???

Switch to blackboard.

A minimizer of the approximation $\hat{\mathcal{L}}(\epsilon; \theta_0)$ is given for $$\begin{aligned} \nabla_\epsilon \hat{\mathcal{L}}(\epsilon; \theta_0) &= 0 \\ &= \nabla_\theta \mathcal{L}(\theta_0) + \frac{1}{\gamma} \epsilon, \end{aligned}$$ which results in the best improvement for the step $\epsilon = -\gamma \nabla_\theta \mathcal{L}(\theta_0)$.

Therefore, model parameters can be updated iteratively using the update rule $$\theta_{t+1} = \theta_t -\gamma \nabla_\theta \mathcal{L}(\theta_t),$$ where

$\theta_0$ are the initial parameters of the model;
$\gamma$ is the learning rate;
both are critical for the convergence of the update rule.