Machine Learning Basics

pedagogic talk mainly based on
Murphy, Machine Learning - A Probabilistic Perspective (2012)
Theis, Lecture Notes on Statistical Learning, TU Munich (2016)
Goodfellow, Bengio & Courville, Deep Learning (2016)

Nanotemper Technologies · Munich · 10 June 2016

F. Alexander Wolf |

Institute of Computational Biology

Helmholtz Zentrum München

The Future of Robotics and Artificial Intelligence, Andrew Ng, Stanford University, 2011

Machine learning in robotics, natural language processing, neuroscience research, and computer vision.
Jordan & Mitchell, Science 349, 255 (2015)

What is Machine Learning?

It's statistics using models with higher complexity.

These might yield higher precision but are less interpretable.

R. Tibshirani, Lecture Notes, Stanford U (2012)

What does Machine Learning?

  • Estimate a functional relation $$f : \mathcal{X} \rightarrow \mathcal{Y} \qquad X \mapsto Y$$ from data $\mathcal{D} = \{(x_i,y_i)\}_{i=1}^{N}$ (supervised case).
  • Estimate similarity in the space $\mathcal{X}$ from data $\mathcal{D} = \{x_i\}_{i=1}^{N}$, $x_i \in \mathcal{X}$ (unsupervised case).


  • Estimation based on data is referred to as learning.
  • Also the supervised case requires learning similarity in $\mathcal{X}$.

Classification example

Learn function $$f : \mathbb{R}^{28\times 28} \rightarrow \{2,4\}.$$

Examples from MNIST data base.

In which way are samples, e.g. for the label $y=2$, similar to each other?

▷ Strategy: find coordinates = features that reveal the similarity!

▷ Here PCA: diagonalize the covariance matrix $(x_i^\top \cdot x_j)_{ij=1}^N$.

A simple model: k Nearest Neighbors

Model function $\hat f$: estimator $\hat y_x$ for $Y$ given $X=x$ $$ \hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i $$

Hastie et al., Elements of Statistical Learning (2009)

A simple model: k Nearest Neighbors

Model function $\hat f$: estimator for $Y$ given $X=x$ $$ \hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i $$ Probabilistic model definition reflects uncertainty $$ p(y\,|\,x,\mathcal{D}) = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} \mathbb{I}(y_i = y) $$

  • Overfitting and Bias-Variance tradeoff: the lower the variance, the higher the bias.
  • Is non-parametric, so no learning of parameters.
  • Assumption: Euclidean distance is good similarity measure for $x$.
  • Curse of dimensionality: does not work in high dimensions.

Another simple model: linear regression

Estimator for $y$ given $x$ $$ \hat y_x = \hat f_{\theta}(x) = \mathrm{E}_{p(y\,|\,x,\theta)}[y] = w_0 + x^\top w $$ Probabilistic model definition $$ p(y\,|\,x,\theta) = \mathcal{N}(y \,|\, \hat y_x,\sigma), \quad \theta = (w_0,w,\sigma) $$ Estimate parameters from data $\mathcal{D}$ $$ \theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D}) $$

  • Parametric, parameters $\theta$ have to be learned.
  • High bias due to linearity assumption, but works in high dimensions, and is easily interpretable.

Learning parameters

Estimate parameters from data $\mathcal{D}$ $$ \theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D},\mathrm{model},\mathrm{beliefs}), \qquad {Optimization} $$ assuming a model and prior beliefs about parameters. Now
$$ p(\theta\,|\,\mathcal{D}) = p(\mathcal{D}\,|\,\theta)p(\theta)/p(\mathcal{D}). \qquad\quad {Bayes'~rule} $$
Evaluate: assume uniform prior $p(\theta)$ and iid samples $(y_i, x_i)$ $$ p(\theta\,|\,\mathcal{D}) \propto p(\mathcal{D}\,|\,\theta) = \prod_{i=1}^N p(y_i, x_i \,|\,\theta) \propto \prod_{i=1}^N p(y_i \,|\, x_i, \theta) $$

Linear regression: $ \log p(\theta\,|\,\mathcal{D}) \simeq \sum_{i=1}^N (y_i - \hat f_{x_1})^2$   ▷ least squares!

Learning parameters: robot example

Example based on S. Thrun, Statistics, Udacity (2012)
One example: Deep Learning

Deep Learning: Neural Network Model


A Neural Network consists of layered linear regressions (one for each neuron) stacked with non-linear activation functions $\phi$.

\begin{align} P(y\,|\,x,\theta) & = \mathcal{N}(y \,|\, v^\top z(x), \sigma^2)\\ z(x) & = (\phi(w_1^\top x), \ldots, \phi(w_H^\top x)) \end{align}
  • Deep learning means many layers.
  • In each hidden layer, combine weights $ w_i$ to matrix $\mathbf{W}$.

Deep Learning: Idea

Hubel and Wiesel (1959, 1962, 1968): Nobel prize 1981 for work on mammalian vision system.   ▷ Results on primary visual cortex (V1).

  • V1 is arranged in a spatial map mirroring the structure of the image in the retina.
  • V1 has simple cells whose activity is a linear function of the image in a small localized receptive field.
  • V1 has complex cells whose activity is invariant to small spatial translations.
  • Neurons in V1 respond most strongly to very specific, simple patterns of light, such as oriented bars, but respond hardly to any other patterns.

Deep Learning: Convolution Layer

  • discrete convolution of functions $f_t$ and $w_t$, $t\in\{1,2,...,D\}$,
    $$ \mathbf{\tilde f} = \sum_\tau w_{t-\tau}\, f_\tau = \mathbf{W} \mathbf{f}, \quad \mathbf{\tilde f},\mathbf{f} \in \mathbb{R}^D $$ where $W_{t\tau} = w_{t-\tau}$, $\mathbf{W} \in \mathbb{R}^{D\times D}$.

▷  Instead of $D^2$, only $D$ independent components.

Natural extension: sparsity

  • demand: $w_{t-\tau} \stackrel{!}{=} 0$ for $|t-\tau| > d$
    [usual property of kernels: e.g. Gaussian $ W_{t\tau} = e^{-\frac{(t-\tau)^2}{2d^2}}$]

▷  Instead of $D^2$, only $2d$ nonzero components.  ▷  Statistics ☺!

Deep Learning: Convolution Layer

general weight matrix $\mathbf{W}$
(arrows represent arbitrary values)
$\mathbf{\tilde f}$
receptive field of $\tilde f_t$: full range $D$
convolution type $\mathbf{W}$
(arrows: same values across receptive fields )
$\mathbf{\tilde f}$
receptive field of $\tilde f_t$: local range $2d$

Deep Learning: Why is convolution useful?

Consider an example ($d=1$)

$ \mathbf{W} = \left(\begin{array}{ccccc} \ddots & -1 & 1 & 0 & \ddots\\ \ddots & 0 & -1 & 1 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = f_t - f_{t-1}$,

that is, $\,\,\mathbf{f}$ =

$\mapsto\,\, \mathbf{\tilde f}$ =

▷  Simple edge structures revealed! Just as the simple cells in V1!

Deep Learning: Pooling layers


  • In most cases, classification information does not depend strongly on the location (index $t$) of a pattern. That is, the presence of a pattern is more important than its location.
  • In many cases, our only interest is the presence or absence of a pattern.

Max-Pooling Layer

  • implement local translational invariance
  • Just as complex cells in V1.

Deep Learning: Convolutional Neural Network

  1. Read input $\mathbf{f}$.
  2. Convolution stage
    $\,\mathbf{\tilde f}^{(k)} := \mathbf{W}^{(k)} \mathbf{f},$
    where $\mathbf{W}^{(k)}$ is one of $K$ convolution kernels, $k=1,...,K$.
  3. Detector stage $\, \tilde f_t^{(k)} := \phi(\tilde f_t^{(k)} + b)$ where $\phi$ is an activation function, $b$ a bias.
  4. Pooling stage $\, \tilde f_t^{(k)} := \max_{\tau \in [t-d,t+d]} \tilde f_t^{(k)}$

Deep Learning: Comments

  • Receptive field grows and more and more complex features are constructed in each layer.

  • What to generally learn from deep learning?
    Convolutional networks are so successful because they efficiently encode our (correct) beliefs about the structure of certain data (translation-invariant, simple local features, complex features from simple features).
    ▷ Understand the similarity structure of your data and use a model that reflects it!


  • Machine Learning ist Statistics with models of higher complexity.
  • It's all about similarity in the data space.
  • Two simple examples: kNN and linear regression
  • Learning is Bayes rule followed by an optimization.
  • Deep Learning: very successful way of understanding and exploiting the similarity structure of e.g. image data.

Thank you!