pedagogic talk mainly based on
Nanotemper Technologies · Munich · 10 June 2016
F. Alexander Wolf |
Institute of Computational Biology
Helmholtz Zentrum München
Machine learning in robotics,
natural language processing, neuroscience research,
and computer vision.
It's statistics using models with higher complexity.
These might yield higher precision but are less interpretable.
Comments
Learn function $$f : \mathbb{R}^{28\times 28} \rightarrow \{2,4\}.$$
In which way are samples, e.g. for the label $y=2$, similar to each other?
▷ Strategy: find coordinates = features that reveal the similarity!
▷ Here PCA: diagonalize the covariance matrix $(x_i^\top \cdot x_j)_{ij=1}^N$.
Model function $\hat f$: estimator $\hat y_x$ for $Y$ given $X=x$ $$ \hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i $$
Model function $\hat f$: estimator for $Y$ given $X=x$ $$ \hat y_x = \hat f_\mathcal{D}(x) = \mathrm{E}_{p(y\,|\,x,\mathcal{D})}[y] = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} y_i $$ Probabilistic model definition reflects uncertainty $$ p(y\,|\,x,\mathcal{D}) = \frac{1}{k} \sum_{i \in N_k(x,\mathcal{D})} \mathbb{I}(y_i = y) $$
Estimator for $y$ given $x$ $$ \hat y_x = \hat f_{\theta}(x) = \mathrm{E}_{p(y\,|\,x,\theta)}[y] = w_0 + x^\top w $$ Probabilistic model definition $$ p(y\,|\,x,\theta) = \mathcal{N}(y \,|\, \hat y_x,\sigma), \quad \theta = (w_0,w,\sigma) $$ Estimate parameters from data $\mathcal{D}$ $$ \theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D}) $$
Estimate parameters from data $\mathcal{D}$
$$ \theta^* = \text{argmax}_\theta p(\theta\,|\,\mathcal{D},\mathrm{model},\mathrm{beliefs}), \qquad {Optimization}
$$
assuming a model and prior beliefs about parameters.
Now
$$ p(\theta\,|\,\mathcal{D})
= p(\mathcal{D}\,|\,\theta)p(\theta)/p(\mathcal{D}). \qquad\quad {Bayes'~rule}
$$
Evaluate: assume uniform prior $p(\theta)$ and iid samples $(y_i, x_i)$
$$
p(\theta\,|\,\mathcal{D})
\propto p(\mathcal{D}\,|\,\theta)
= \prod_{i=1}^N p(y_i, x_i \,|\,\theta)
\propto \prod_{i=1}^N p(y_i \,|\, x_i, \theta)
$$
Linear regression: $ \log p(\theta\,|\,\mathcal{D}) \simeq \sum_{i=1}^N (y_i - \hat f_{x_1})^2$ ▷ least squares!
A Neural Network consists of layered linear regressions (one for each neuron) stacked with non-linear activation functions $\phi$.
▷ Instead of $D^2$, only $D$ independent components.
▷ Instead of $D^2$, only $2d$ nonzero components. ▷ Statistics ☺!
Consider an example ($d=1$)
$ \mathbf{W} = \left(\begin{array}{ccccc} \ddots & -1 & 1 & 0 & \ddots\\ \ddots & 0 & -1 & 1 & \ddots \end{array} \right)$ $\,\Leftrightarrow\,$ $\tilde f_t = f_t - f_{t-1}$,
that is, $\,\,\mathbf{f}$ =
$\mapsto\,\, \mathbf{\tilde f}$ =
▷ Simple edge structures revealed! Just as the simple cells in V1!
Thank you!