Convolutional Neural Networks
pedagogic talk based on
Goodfellow, Bengio & Courville, Deep Learning (2016, Ch. 9)
and deeplearning.net/tutorial/lenet.html
Deep Learning Seminar · ICB · Helmholtz Munich · 9 May 2016
F. Alexander Wolf  falexwolf.de
Institute of Computational Biology
Helmholtz Munich
fullscreen: 'f' / navigation: arrow keys / black screen: 'b' / overview: 'o'
Motivation
Hubel and Wiesel (1959, 1962, 1968):
Nobel prize 1981 for work on mammalian vision system.
▷ Results on primary visual cortex (V1).

V1 is arranged in a spatial map mirroring the structure of the image
in the retina.

V1 has simple cells whose activity is a linear function of
the image in a small localized receptive field.

V1 has complex cells whose activity is invariant to small
spatial translations.

Neurons in V1 respond most strongly to very
specific, simple patterns of light, such as oriented bars, but
respond hardly to any other patterns.
Recap
 A Multilayer perceptron = Feedforward Neural
Network is a probabilisitic model: layered matrixmultiplications stacked
with nonlinear activation functions.
 Optimize log likelihood (classification or
regression error) by stochastic gradient descent.
Full gradient by backpropagating layer gradients.
So, what is a Convolutional Neural Network?
 It's simply a neural network that
uses convolution
in place of a "general matrix multiplication" in
at least one of its layers.
LeCun, Bottou, Bengio & Haffner, Proc. IEEE 86 2278 (1998)
What is a convolution?
 convolution of functions $f(t)$ and $w(t)$
$$ (f * w)(t) = \int_{\infty}^\infty d\tau\, w(t\tau)\, f(\tau) $$
 similar to cross correlation of $f(t)$ and $w(t)$
$$ (f \star w)(t) = \int_{\infty}^\infty d\tau\, w(t+\tau)\, f(\tau)$$
▷
In deep learning, both of these operations are used in its discrete form
and referred to as "convolution".
$$ \quad\,\, (f * w)_t = \sum_\tau w_{t\tau}\, f_\tau$$
Convolution as constraint matrix multiplication

discrete convolution of functions $f_t$ and
$w_t$, $t\in\{1,2,...,D\}$,
$$ \mathbf{\tilde f} = \sum_\tau w_{t\tau}\, f_\tau
= \mathbf{W} \mathbf{f}, \quad \mathbf{\tilde f},\mathbf{f} \in \mathbb{R}^D $$
where $W_{t\tau} = w_{t\tau}$, $\mathbf{W} \in
\mathbb{R}^{D\times D}$.
▷
Instead of $D^2$, only $D$ independent components.
Natural extension: sparsity

demand: $w_{t\tau} \stackrel{!}{=} 0$ for $t\tau > d$
[usual property of kernels: e.g. Gaussian $ W_{t\tau} =
e^{\frac{(t\tau)^2}{2d^2}}$]
▷
Instead of $D^2$, only $2d$ nonzero components.
▷ Statistics ☺!
Convolution as sparsity constraint: graphical
general weight matrix $\mathbf{W}$
(arrows represent arbitrary values)
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: full range $D$
convolution type $\mathbf{W}$
(arrows: same values across receptive fields )
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: local range $2d$
When is convolution a useful sparsity constraint?
Consider an example ($d=1$)
$ \mathbf{W} = \left(\begin{array}{ccccc}
\ddots & 1 & 1 & 0 & \ddots\\
\ddots & 0 & 1 & 1 & \ddots \end{array} \right)$
$\,\Leftrightarrow\,$ $\tilde f_t = f_t  f_{t1}$,
that is, $\,\,\mathbf{f}$ =
$\mapsto\,\, \mathbf{\tilde f}$ =
▷
Simple edge structures are revealed!
When is convolution a useful sparsity constraint?

To obtain $\mathbf{\tilde f}$, the same local linear operation
is applied to every $t$ and its $d$ neighbors. Here, it's multiplying
with $\mathbf{w} = (1, 1)$.

This is meaningful if data has dependencies
between degrees of freedom $f_t$ that
appear
independent of the index $t$
and are constraint locally to distance $d$:
data features local patterns.
Examples for such local patterns
 Images: edges
 Audio: frequency patterns
 Language: grammar structures
Test: informationcontent not invariant under permutation?
Learn a kernel
Let us initialize a kernel ($d=1$) with random values $w_i$
$ \mathbf{W} = \left(\begin{array}{ccccc}
\ddots & w_1 & w_2 & 0 & \ddots\\
\ddots & 0 & w_1 & w_2 & \ddots \end{array} \right)$
$\,\Leftrightarrow\,$ $\tilde f_t = w_1 f_t + w_2 f_{t1}$,
again, $\,\,\mathbf{f}$ =
$\mapsto\,\, \mathbf{\tilde f}$ =

Also the random kernel seems to detect edges very well!
Most work is already done! ▷
It seems not too hard to learn meaningful kernels!
Learn several kernels per layer
 Several kernels should learn different
local patterns:
e.g. edges oriented in different directions.

So: evidently it's meaningful to use the same kernel for each
location $t$ in the input, because patterns appear in the same
way across locations $t$.
But: What about the relevance of
where patterns appear?
Pooling layers and local translational invariance
Assumption

In most cases, classification information does not depend strongly
on the location (index $t$) of a pattern.
That is, the presence of a pattern is more
important than its location.
 In many cases, our only
interest is the presence or absence of a pattern.
Example: shifting the input
 Pooling layer = implement local translational invariance
 here: max pooling layer
Assemble everything

Read input $\mathbf{f}$.

Convolution stage
$\,\mathbf{\tilde f}^{(k)} := \mathbf{W}^{(k)} \mathbf{f},$
where $\mathbf{W}^{(k)}$ is one of $K$ convolution kernels,
$k=1,...,K$.

Detector stage
$\, \tilde f_t^{(k)} := \phi(\tilde f_t^{(k)} + b)$
where $\phi$ is an activation function, $b$ a bias.

Pooling stage
$\, \tilde f_t^{(k)} := \max_{\tau \in [td,t+d]} \tilde f_t^{(k)}$
Some comments

Receptive field can grow layer wise.

Downsampling after pooling layer accounts for reduced information.

From a Bayesian view, convolutional networks encode our believes
about the structure of certain data
 as argued up to here 
using an infinitely strong prior.
More comments/questions

Why do we put fully connected layers on top of the
convolutional layers? For example, if our classification label
is translationinvariant, there should be a smarter way?

Traditionally, CNNs have been used for wholeimage
classification. Recent work deals with their application to
pixelwise classification (object detection, segmentation, tracking),
and aims at going beyond and independent treatment of patches.
Thank you!