### Convolutional Neural Networks

pedagogic talk based on Goodfellow, Bengio & Courville, Deep Learning (2016, Ch. 9)
and deeplearning.net/tutorial/lenet.html

Deep Learning Seminar · ICB · Helmholtz Munich · 9 May 2016

F. Alexander Wolf | falexwolf.de

Institute of Computational Biology

Helmholtz Munich

fullscreen: 'f' / navigation: arrow keys / black screen: 'b' / overview: 'o'

### Motivation

Hubel and Wiesel (1959, 1962, 1968): Nobel prize 1981 for work on mammalian vision system.   ▷ Results on primary visual cortex (V1).

• V1 is arranged in a spatial map mirroring the structure of the image in the retina.
• V1 has simple cells whose activity is a linear function of the image in a small localized receptive field.
• V1 has complex cells whose activity is invariant to small spatial translations.
• Neurons in V1 respond most strongly to very specific, simple patterns of light, such as oriented bars, but respond hardly to any other patterns.

### Recap

• A Multilayer perceptron = Feedforward Neural Network is a probabilisitic model: layered matrix-multiplications stacked with non-linear activation functions.
• Optimize log likelihood (classification or regression error) by stochastic gradient descent. Full gradient by backpropagating layer gradients.

#### So, what is a Convolutional Neural Network?

• It's simply a neural network that uses convolution in place of a "general matrix multiplication" in at least one of its layers.
LeCun, Bottou, Bengio & Haffner, Proc. IEEE 86 2278 (1998)

### What is a convolution?

• convolution of functions $f(t)$ and $w(t)$ $$(f * w)(t) = \int_{-\infty}^\infty d\tau\, w(t-\tau)\, f(\tau)$$
• similar to cross correlation of $f(t)$ and $w(t)$ $$(f \star w)(t) = \int_{-\infty}^\infty d\tau\, w(t+\tau)\, f(\tau)$$

▷

In deep learning, both of these operations are used in its discrete form and referred to as "convolution".

$$\quad\,\, (f * w)_t = \sum_\tau w_{t-\tau}\, f_\tau$$

### Convolution as constraint matrix multiplication

• discrete convolution of functions $f_t$ and $w_t$, $t\in\{1,2,...,D\}$,
$$\mathbf{\tilde f} = \sum_\tau w_{t-\tau}\, f_\tau = \mathbf{W} \mathbf{f}, \quad \mathbf{\tilde f},\mathbf{f} \in \mathbb{R}^D$$ where $W_{t\tau} = w_{t-\tau}$, $\mathbf{W} \in \mathbb{R}^{D\times D}$.

▷  Instead of $D^2$, only $D$ independent components.

#### Natural extension: sparsity

• demand: $w_{t-\tau} \stackrel{!}{=} 0$ for $|t-\tau| > d$
[usual property of kernels: e.g. Gaussian $W_{t\tau} = e^{-\frac{(t-\tau)^2}{2d^2}}$]

▷  Instead of $D^2$, only $2d$ nonzero components.  ▷  Statistics ☺!

### Convolution as sparsity constraint: graphical

general weight matrix $\mathbf{W}$
(arrows represent arbitrary values)
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: full range $D$
convolution type $\mathbf{W}$
(arrows: same values across receptive fields )
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: local range $2d$

### When is convolution a useful sparsity constraint?

Consider an example ($d=1$)

$\mathbf{W} = \left(\begin{array}{ccccc} \ddots & -1 & 1 & 0 & \ddots\\ \ddots & 0 & -1 & 1 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = f_t - f_{t-1}$,

that is, $\,\,\mathbf{f}$ =

$\mapsto\,\, \mathbf{\tilde f}$ =

▷  Simple edge structures are revealed!

### When is convolution a useful sparsity constraint?

• To obtain $\mathbf{\tilde f}$, the same local linear operation is applied to every $t$ and its $d$ neighbors. Here, it's multiplying with $\mathbf{w} = (-1, 1)$.
• This is meaningful if data has dependencies between degrees of freedom $f_t$ that appear independent of the index $t$ and are constraint locally to distance $d$: data features local patterns.

#### Examples for such local patterns

• Images: edges
• Audio: frequency patterns
• Language: grammar structures

Test: information-content not invariant under permutation?

### Learn a kernel

Let us initialize a kernel ($d=1$) with random values $w_i$

$\mathbf{W} = \left(\begin{array}{ccccc} \ddots & w_1 & w_2 & 0 & \ddots\\ \ddots & 0 & w_1 & w_2 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = w_1 f_t + w_2 f_{t-1}$,

again, $\,\,\mathbf{f}$ =

$\mapsto\,\, \mathbf{\tilde f}$ =

• Also the random kernel seems to detect edges very well! Most work is already done! ▷  It seems not too hard to learn meaningful kernels!

### Learn several kernels per layer

• Several kernels should learn different local patterns: e.g. edges oriented in different directions.
• So: evidently it's meaningful to use the same kernel for each location $t$ in the input, because patterns appear in the same way across locations $t$.

But: What about the relevance of where patterns appear?

### Pooling layers and local translational invariance

#### Assumption

• In most cases, classification information does not depend strongly on the location (index $t$) of a pattern. That is, the presence of a pattern is more important than its location.
• In many cases, our only interest is the presence or absence of a pattern.

#### Example: shifting the input

• Pooling layer = implement local translational invariance
• here: max pooling layer

### Assemble everything

1. Read input $\mathbf{f}$.
2. Convolution stage
$\,\mathbf{\tilde f}^{(k)} := \mathbf{W}^{(k)} \mathbf{f},$
where $\mathbf{W}^{(k)}$ is one of $K$ convolution kernels, $k=1,...,K$.
3. Detector stage $\, \tilde f_t^{(k)} := \phi(\tilde f_t^{(k)} + b)$ where $\phi$ is an activation function, $b$ a bias.
4. Pooling stage $\, \tilde f_t^{(k)} := \max_{\tau \in [t-d,t+d]} \tilde f_t^{(k)}$

• Receptive field can grow layer wise.

• Downsampling after pooling layer accounts for reduced information.

• From a Bayesian view, convolutional networks encode our believes about the structure of certain data - as argued up to here - using an infinitely strong prior.