Convolutional Neural Networks
pedagogic talk based on
Goodfellow, Bengio & Courville, Deep Learning (2016, Ch. 9)
and deeplearning.net/tutorial/lenet.html
Deep Learning Seminar · ICB · Helmholtz Munich · 9 May 2016
F. Alexander Wolf | falexwolf.de
Institute of Computational Biology
Helmholtz Munich
fullscreen: 'f' / navigation: arrow keys / black screen: 'b' / overview: 'o'
Motivation
Hubel and Wiesel (1959, 1962, 1968):
Nobel prize 1981 for work on mammalian vision system.
▷ Results on primary visual cortex (V1).
-
V1 is arranged in a spatial map mirroring the structure of the image
in the retina.
-
V1 has simple cells whose activity is a linear function of
the image in a small localized receptive field.
-
V1 has complex cells whose activity is invariant to small
spatial translations.
-
Neurons in V1 respond most strongly to very
specific, simple patterns of light, such as oriented bars, but
respond hardly to any other patterns.
Recap
- A Multilayer perceptron = Feedforward Neural
Network is a probabilisitic model: layered matrix-multiplications stacked
with non-linear activation functions.
- Optimize log likelihood (classification or
regression error) by stochastic gradient descent.
Full gradient by backpropagating layer gradients.
So, what is a Convolutional Neural Network?
- It's simply a neural network that
uses convolution
in place of a "general matrix multiplication" in
at least one of its layers.
LeCun, Bottou, Bengio & Haffner, Proc. IEEE 86 2278 (1998)
What is a convolution?
- convolution of functions f(t) and w(t)
(f∗w)(t)=∫∞−∞dτw(t−τ)f(τ)
- similar to cross correlation of f(t) and w(t)
(f⋆w)(t)=∫∞−∞dτw(t+τ)f(τ)
▷
In deep learning, both of these operations are used in its discrete form
and referred to as "convolution".
(f∗w)t=∑τwt−τfτ
Convolution as constraint matrix multiplication
-
discrete convolution of functions ft and
wt, t∈{1,2,...,D},
˜f=∑τwt−τfτ=Wf,˜f,f∈RD
where Wtτ=wt−τ, W∈RD×D.
▷
Instead of D2, only D independent components.
Natural extension: sparsity
-
demand: wt−τ!=0 for |t−τ|>d
[usual property of kernels: e.g. Gaussian Wtτ=e−(t−τ)22d2]
▷
Instead of D2, only 2d nonzero components.
▷ Statistics ☺!
Convolution as sparsity constraint: graphical
general weight matrix W
(arrows represent arbitrary values)
˜f
f
receptive field of ˜ft: full range D
convolution type W
(arrows: same values across receptive fields )
˜f
f
receptive field of ˜ft: local range 2d
When is convolution a useful sparsity constraint?
Consider an example (d=1)
W=(⋱−110⋱⋱0−11⋱)
⇔ ˜ft=ft−ft−1,
that is, f =
↦˜f =
▷
Simple edge structures are revealed!
When is convolution a useful sparsity constraint?
-
To obtain ˜f, the same local linear operation
is applied to every t and its d neighbors. Here, it's multiplying
with w=(−1,1).
-
This is meaningful if data has dependencies
between degrees of freedom ft that
appear
independent of the index t
and are constraint locally to distance d:
data features local patterns.
Examples for such local patterns
- Images: edges
- Audio: frequency patterns
- Language: grammar structures
Test: information-content not invariant under permutation?
Learn a kernel
Let us initialize a kernel (d=1) with random values wi
W=(⋱w1w20⋱⋱0w1w2⋱)
⇔ ˜ft=w1ft+w2ft−1,
again, f =
↦˜f =
-
Also the random kernel seems to detect edges very well!
Most work is already done! ▷
It seems not too hard to learn meaningful kernels!
Learn several kernels per layer
- Several kernels should learn different
local patterns:
e.g. edges oriented in different directions.
-
So: evidently it's meaningful to use the same kernel for each
location t in the input, because patterns appear in the same
way across locations t.
But: What about the relevance of
where patterns appear?
Pooling layers and local translational invariance
Assumption
-
In most cases, classification information does not depend strongly
on the location (index t) of a pattern.
That is, the presence of a pattern is more
important than its location.
- In many cases, our only
interest is the presence or absence of a pattern.
Example: shifting the input
- Pooling layer = implement local translational invariance
- here: max pooling layer
Assemble everything
-
Read input f.
-
Convolution stage
˜f(k):=W(k)f,
where W(k) is one of K convolution kernels,
k=1,...,K.
-
Detector stage
˜f(k)t:=ϕ(˜f(k)t+b)
where ϕ is an activation function, b a bias.
-
Pooling stage
˜f(k)t:=maxτ∈[t−d,t+d]˜f(k)t
Some comments
-
Receptive field can grow layer wise.
-
Downsampling after pooling layer accounts for reduced information.
-
From a Bayesian view, convolutional networks encode our believes
about the structure of certain data
- as argued up to here -
using an infinitely strong prior.
More comments/questions
-
Why do we put fully connected layers on top of the
convolutional layers? For example, if our classification label
is translation-invariant, there should be a smarter way?
-
Traditionally, CNNs have been used for whole-image
classification. Recent work deals with their application to
pixelwise classification (object detection, segmentation, tracking),
and aims at going beyond and independent treatment of patches.
Thank you!
Convolutional Neural Networks
pedagogic talk based on
Goodfellow, Bengio & Courville, Deep Learning (2016, Ch. 9)
and deeplearning.net/tutorial/lenet.html
Deep Learning Seminar · ICB · Helmholtz Munich · 9 May 2016
F. Alexander Wolf | falexwolf.de
Institute of Computational Biology
Helmholtz Munich
fullscreen: 'f' / navigation: arrow keys / black screen: 'b' / overview: 'o'