Convolutional Neural Networks

pedagogic talk based on Goodfellow, Bengio & Courville, Deep Learning (2016, Ch. 9)
and deeplearning.net/tutorial/lenet.html

Deep Learning Seminar · ICB · Helmholtz Munich · 9 May 2016


F. Alexander Wolf | falexwolf.de

Institute of Computational Biology

Helmholtz Munich

fullscreen: 'f' / navigation: arrow keys / black screen: 'b' / overview: 'o'

Motivation

Hubel and Wiesel (1959, 1962, 1968): Nobel prize 1981 for work on mammalian vision system.   ▷ Results on primary visual cortex (V1).

  • V1 is arranged in a spatial map mirroring the structure of the image in the retina.
  • V1 has simple cells whose activity is a linear function of the image in a small localized receptive field.
  • V1 has complex cells whose activity is invariant to small spatial translations.
  • Neurons in V1 respond most strongly to very specific, simple patterns of light, such as oriented bars, but respond hardly to any other patterns.

Recap

  • A Multilayer perceptron = Feedforward Neural Network is a probabilisitic model: layered matrix-multiplications stacked with non-linear activation functions.
  • Optimize log likelihood (classification or regression error) by stochastic gradient descent. Full gradient by backpropagating layer gradients.

So, what is a Convolutional Neural Network?

  • It's simply a neural network that uses convolution in place of a "general matrix multiplication" in at least one of its layers.
    LeCun, Bottou, Bengio & Haffner, Proc. IEEE 86 2278 (1998)

What is a convolution?

  • convolution of functions $f(t)$ and $w(t)$ $$ (f * w)(t) = \int_{-\infty}^\infty d\tau\, w(t-\tau)\, f(\tau) $$
  • similar to cross correlation of $f(t)$ and $w(t)$ $$ (f \star w)(t) = \int_{-\infty}^\infty d\tau\, w(t+\tau)\, f(\tau)$$

▷ 

In deep learning, both of these operations are used in its discrete form and referred to as "convolution".

$$ \quad\,\, (f * w)_t = \sum_\tau w_{t-\tau}\, f_\tau$$

Convolution as constraint matrix multiplication

  • discrete convolution of functions $f_t$ and $w_t$, $t\in\{1,2,...,D\}$,
    $$ \mathbf{\tilde f} = \sum_\tau w_{t-\tau}\, f_\tau = \mathbf{W} \mathbf{f}, \quad \mathbf{\tilde f},\mathbf{f} \in \mathbb{R}^D $$ where $W_{t\tau} = w_{t-\tau}$, $\mathbf{W} \in \mathbb{R}^{D\times D}$.

▷  Instead of $D^2$, only $D$ independent components.

Natural extension: sparsity

  • demand: $w_{t-\tau} \stackrel{!}{=} 0$ for $|t-\tau| > d$
    [usual property of kernels: e.g. Gaussian $ W_{t\tau} = e^{-\frac{(t-\tau)^2}{2d^2}}$]

▷  Instead of $D^2$, only $2d$ nonzero components.  ▷  Statistics ☺!

Convolution as sparsity constraint: graphical

general weight matrix $\mathbf{W}$
(arrows represent arbitrary values)
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: full range $D$
convolution type $\mathbf{W}$
(arrows: same values across receptive fields )
$\mathbf{\tilde f}$
$\mathbf{f}$
receptive field of $\tilde f_t$: local range $2d$

When is convolution a useful sparsity constraint?

Consider an example ($d=1$)

$ \mathbf{W} = \left(\begin{array}{ccccc} \ddots & -1 & 1 & 0 & \ddots\\ \ddots & 0 & -1 & 1 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = f_t - f_{t-1}$,

that is, $\,\,\mathbf{f}$ =

$\mapsto\,\, \mathbf{\tilde f}$ =

▷  Simple edge structures are revealed!

When is convolution a useful sparsity constraint?

  • To obtain $\mathbf{\tilde f}$, the same local linear operation is applied to every $t$ and its $d$ neighbors. Here, it's multiplying with $\mathbf{w} = (-1, 1)$.
  • This is meaningful if data has dependencies between degrees of freedom $f_t$ that appear independent of the index $t$ and are constraint locally to distance $d$: data features local patterns.

Examples for such local patterns

  • Images: edges
  • Audio: frequency patterns
  • Language: grammar structures

Test: information-content not invariant under permutation?

Learn a kernel

Let us initialize a kernel ($d=1$) with random values $w_i$

$ \mathbf{W} = \left(\begin{array}{ccccc} \ddots & w_1 & w_2 & 0 & \ddots\\ \ddots & 0 & w_1 & w_2 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = w_1 f_t + w_2 f_{t-1}$,

again, $\,\,\mathbf{f}$ =

$\mapsto\,\, \mathbf{\tilde f}$ =

  • Also the random kernel seems to detect edges very well! Most work is already done! ▷  It seems not too hard to learn meaningful kernels!

Learn several kernels per layer

  • Several kernels should learn different local patterns: e.g. edges oriented in different directions.
  • So: evidently it's meaningful to use the same kernel for each location $t$ in the input, because patterns appear in the same way across locations $t$.

But: What about the relevance of where patterns appear?

Pooling layers and local translational invariance

Assumption

  • In most cases, classification information does not depend strongly on the location (index $t$) of a pattern. That is, the presence of a pattern is more important than its location.
  • In many cases, our only interest is the presence or absence of a pattern.

Example: shifting the input

  • Pooling layer = implement local translational invariance
  • here: max pooling layer

Assemble everything

  1. Read input $\mathbf{f}$.
  2. Convolution stage
    $\,\mathbf{\tilde f}^{(k)} := \mathbf{W}^{(k)} \mathbf{f},$
    where $\mathbf{W}^{(k)}$ is one of $K$ convolution kernels, $k=1,...,K$.
  3. Detector stage $\, \tilde f_t^{(k)} := \phi(\tilde f_t^{(k)} + b)$ where $\phi$ is an activation function, $b$ a bias.
  4. Pooling stage $\, \tilde f_t^{(k)} := \max_{\tau \in [t-d,t+d]} \tilde f_t^{(k)}$

Some comments

  • Receptive field can grow layer wise.


  • Downsampling after pooling layer accounts for reduced information.

  • From a Bayesian view, convolutional networks encode our believes about the structure of certain data - as argued up to here - using an infinitely strong prior.

More comments/questions

  • Why do we put fully connected layers on top of the convolutional layers? For example, if our classification label is translation-invariant, there should be a smarter way?
  • Traditionally, CNNs have been used for whole-image classification. Recent work deals with their application to pixelwise classification (object detection, segmentation, tracking), and aims at going beyond and independent treatment of patches.

Thank you!