**Prof. Dr. Ullrich Köthe, SS 2019**

Machine learning is one of the most promising approaches to address difficult decision and regression problems under uncertainty. The general idea is very simple: Instead of modeling a solution explicitly, a domain expert provides example data that demonstrate the desired behavior on representative problem instances. A suitable machine learning algorithm is then trained on these examples to reproduce the expert's solutions as well as possible and generalize it to new, unseen data. The last two decades have seen tremendous progress towards ever more powerful algorithms. This course will build upon the introductory material of last semester's Fundamentals of Machine Learning course and covers the most important advanced concepts and methods.

The lecture belongs to the Master of Applied Informatics program, but is also recommended for students towards a Master of Physics (specialisation Computational Physics), Master in Scientific Computing and anyone interested.

Solid knowledge in linear algebra, analysis (multi-dimensional differentiation and integration) and probability theory is required. Participants should be familiar with the fundamental concepts from last semester's "Fundamentals of Machine Learning" course or equivalent (for example, you can prepare yourself with Fred Hamprecht's Pattern Recognition Video Lecture).

### Dates:

Lecture | Wednesdays | 11:15-12:45 | COS lecture hall, INF 360 | |

Lecture | Fridays | 11:15-12:45 | Mathematikon lecture hall, INF 205 | |

Tutorials | TBA | |||

Please sign up for the lecture via Muesli (not yet active). |

Homework assignments and other course material will be published on Moodle (not yet active).

### Recommended Textbooks:

**General**

- Christopher M. Bishop:
*"Pattern Recognition and Machine Learning"*, 738 pages, Springer, 2006 - David Barber:
*"Bayesian Reasoning and Machine Learning"*, 720 pages, Cambridge University Press, 2012 (online version) - Kevin P. Murphy:
*"Machine Learning -- A Probabilistic Perspective"*, 1105 pages, The MIT Press, 2012 - Trevor Hastie, Robert Tibshirani, Jerome Friedman:
*"The Elements of Statistical Learning"*(2nd edition), 745 pages, Springer, 2009 (online version) - Richard O. Duda, Peter E. Hart, David G. Stork:
*"Pattern Classification"*(2nd edition), 680 pages, Wiley, 2000 (online version)

**Special Topics**

- Ian Goodfellow, Yoshua Bengio and Aaron Courville:
*"Deep Learning"*, 801 pages, The MIT Press, 2016 (online version) - Carl Edward Rasmussen and Christopher Williams:
*"Gaussian Processes for Machine Learning"*, 266 pages, The MIT Press, 2006 (online version) - Judea Pearl:
*"Causality"*(2nd edition), 484 pages, Cambridge University Press, 2009 - Daphne Koller, Nir Friedman:
*"Probabilistic Graphical Models"*, 1280 pages, The MIT Press, 2009 (advanced methods)

## Approximate Content:

(subject to yearly change)

**Lessons 1 and 2: Recapitulation**

Textbooks: Duda/Hart/Stork: sections 2.5, 2.6 and chapter 5; Hastie/Tibshirani/Friedman: chapter 4; Bishop: chapter 4- Machine learning basics: classification vs. regression, supervised vs. unsupervised
- Mathematical notation of the lecture
- Types of classifiers: decision rules, discriminative models, generative models
- A linear generative model: linear discriminant analysis (LDA)
- A linear decision rule: perceptron (perceptron loss and gradient descent)
- An improved linear decision rule: linear support vector machines (max-margin decisions, hinge loss, the SVM optimization problem)

**Lessons 3 and 4: Logistic Regression**

Bottou (2012):*"Stochastic Gradient Descent Tricks"*(PDF)

Minka (2003):*"A comparison of numerical optimizers for logistic regression"*(PDF)- A linear discriminative model: logistic regression (LR)
- Logistic loss and the regularized LR optimization problem
- Gradient descent and stochastic gradient descent algorithms for LR
- Iterated reweighted least-squares (IRLS)
- The dual optimization problem of LR and its solution by dual coordinate ascend
- Why is stochastic gradient descent fast on big data?
- Variants of stochastic gradient descent: momentum, mini-batches, learning rate control etc.

- A linear discriminative model: logistic regression (LR)
**Lessons 5 to 11: Neural Networks**

Textbooks: Goodfellow, Bengio, Courville: Part II; Bishop: chapter 5; Hastie/Tibshirani/Friedman: chapter 11; Murphy: section 16.5; Duda/Hart/Stork: chapter 6

Nielsen (2014):*"Neural Networks and Deep Learning"*(great online book)*"Theano Tutorials"*(great introduction to neural networks using the Theano software framework)

He et al. (2015):*"Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"*(PReLU activation function, PDF)

Clevert et al. (2016):*"Fast and Accurate Deep Network Learning by Exponential Linear Units"*(ELU activation function, PDF)

Ronneberger et al. (2015):*"U-Net: Convolutional Networks for Biomedical Image Segmentation"*(PDF)

Ruder (2016):*"An overview of gradient descent optimization algorithms"*(online version)

Wilson and Martinez (2003):*"The general inefficiency of batch training for gradient descent learning"*(PDF)

Ioffe and Szegedy (2015):*"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift"*(PDF)

Srivastava et al. (2014):*"Dropout: A Simple Way to Prevent Neural Networks from Overfitting"*(PDF)

Gao and Zhou (2014):*"Dropout Rademacher Complexity of Deep Neural Networks"*(PDF)- Motivation: non-linearly combine linear models in parallel and in series to overcome the limitations of single linear models
- History: perceptron, backpropagation, unsupervised pre-training, dropout, piecewise linear activation functions
- Definition of single neurons and activation functions
- Multi-layer architecture, hidden neurons, bias neurons, weights, activation functions, notation
- Theoretical results on performance: VC dimension, universal approximation theorem
- Training by Backpropagation:
- Detailed derivation of the backpropagation algorithm
- Loss functions (quadratic, logistic, cross-entropy) and their gradients

- Modern activation functions (ReLU, PReLU, ELU) and their gradients
- Convolutional Neural Networks (CNNs), U-Nets
- Training Tricks:
- Mini-batch stochastic gradient descent, momentum, AdaDelta, ADAM
- Hyper-parameter selection, termination criterion
- Weight initialization, weight decay
- Batch normalization
- Dropout and the resulting reduction in Rademacher complexity
- Data augmentation

**Lessons 12 to 14: Gaussian Processes**

Textbooks: Rasmussen and Williams; Barber: chapter 19; Bishop: chapter 6 (specifically, section 6.4); Murphy: chapter 15

Unser (1999):*"Splines: A Perfect Fit for Signal and Image Processing"*(PDF1, PDF2)

Snoek et al. (2012):*"Practical Bayesian Optimization of Machine Learning Algorithms"*(PDF)- Approaches to learning when the iid assumption is violated:
- Approximately recover the iid assumption in an augmented feature space (define new features by filters over the neighborhood of each instance)
- Graphical models (explicitly model more compledx independence assumptions => next chapter)
- Gaussian processes (learn the full joint probability as a multivariate Gaussian)

- Definition of Gaussian processes (Gaussian distribution over a function space, prior assumption modeling via a kernel Hilbert space norm, finite multi-dimensional marginals)
- Derivation of the fundamental interpolation equation and its variance via the conditional distribution of new data, given training data
- Modeling uncertainty in the training data, Gaussian process regression, relation of GPs to kernel ridge regression
- Example: linear interpolation
- Kernel functions:
- General rules: Mercer's condition, constructing new kernels from existing ones
- Radial basis functions for scattered data (squared exponential, gamma-exponential, Matern, inverse quadrics, thin-plate splines, Wendland splines)
- Tensor product kernels for data on a grid (squared exponential, B-splines, cardinal functions), implementation of grid-based GP interpolation by filtering

- Application to hyper-parameter optimization (Snoek et al. 2012)
- Application to classification (mapping a latent function through a sigmoid, optimization via the Laplace approximation and Newton's method)

- Approaches to learning when the iid assumption is violated:
**Lessons 15 and 16: Bayesian Network Introduction**

Textbooks: Pearl: chapter 1; Barber: section 3.3; Bishop: sections 8.1 amd 8.2; Murphy: chapter 10; Koller and Friedman: chapter 3

Peters (2015):*"Causality - Lecture Notes"*(PDF)- Pitfalls of data dependencies: Alice's boys and Simpson's paradox (Berkeley admission example)
- Decompositions of joint probabilities according to Bayes rule or in terms of Gibbs distributions
- Graphical models, fundamental tasks in graphical models (marginalization, MAP solution, decision making)
- Directed graphical models, Bayesian networks
- Pearl's basic network construction algorithm, decomposition of a joint probability into parent-child factors
- Fundamental configurations (causal chain, common cause, common effect) and their independence properties
- Bergson's paradox ("explaining away" effect), burglary alarm example
- Markov properties, d-separation, faithfulness
- Efficient algorithm to check for d-separation in a given DAG

**Lessons 17 and 18: Inference in Bayesian Networks**

Textbooks: Barber: sections 5.1 and 5.2; Bishop: section 8.4; Murphy: chapter 20; Koller and Friedman: chapters 8 to 11

Kschischang, Brendan, Loeliger (2001):*"Factor Graphs and the Sum-Product Algorithm"*(PDF)

Aji and McEliece (2000):*"The generalized distributive law"*(PDF)- Basic inference algorithm to compute marginals: variable elimination
- Principles and examples (burglary alarm, the naive Bayes classifier)
- Exponential complexity of variable elimination, complexity reduction by means of the distributive law
- Example: chain graphical model, reduction of inference to matrix exponentiation
- Belief propagation: factor graphs, message passing, sum-product algorithm
- Alice's boys reconsidered

**Lessons 19 to 21: Temporal Bayesian Networks**

Textbooks: Barber: sections 11.2 (EM algorithm) 23.2 and 23.3 (HMMs); Bishop: section 13.2; Murphy: chapter 17 (HMMs) and section 11.4 (EM algorithm)

Welch (2003):*"Hidden Markov Models and the Baum–Welch Algorithm"*(PDF)- Markov chains, probabilistic finite state machines
- Page rank algorithm, the Google matrix
- Hidden Markov models
- Belief propagation in HMMs, the forward-backward algorithm
- Algebraic requirements of belief propagation, sum-product semiring, min-sum semiring, min-sum algorithm
- Finding the MAP configuration of an HMM using the min-sum algorithm, the Viterbi algorithm
- Showing the equivalence between belief propagation and explicit recursion: forward filtering, predictor-corrector method
- Supervised and unsupervised learning of an HMM
- Baum-Welch algorithm, expectation maximization
- Initialization of the Baum-Welch algorithm: random initialisation vs. Viterbi counting
- The relationship between observations, marginals and MAP solutions in HMMs

**Lessons 22 to 27: Causality Analysis with Bayesian Networks**

Textbooks: Pearl: chapters 2 and 3; Barber: section 3.4; Murphy: sections 26.4 to 26.6; Koller and Friedman: chapters 14 to 16

Peters (2015):*"Causality - Lecture Notes"*(PDF)

Le et al. (2015):*"A fast PC algorithm for high dimensional causal discovery with multi-core PCs"*(PDF)

Eberhardt, Glymour, Scheines (2005):*"On the Number of Experiments Sufficient and in the Worst Case Necessary to Identify All Causal Relations Among N Variables"*(PDF)

Claassen, Mooij, Heskes (2013):*"Learning Sparse Causal Models is not NP-hard"*(PDF)

Gretton et al. (2007):*"A kernel statistical test of independence"*(PDF) and newer papers on this topic

Chickering (2002):*"Optimal Structure Identification With Greedy Search"*- score-based approximate BN construction (PDF)

Peters, Mooij, Janzing, Schölkopf (2014):*"Causal Discovery with Continuous Additive Noise Models"*- great tutorial on causality, RESIT algorithm (PDF)

Greenland and Robins (1988):*"Identifiability, exchangeability, and epidemiological confounding"*(PDF)

Rosenbaum and Rubin (1983):*"The Central Role of the Propensity Score in Observational Studies for Causal Effects"*(PDF)

Austin (2011):*"An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies"*(PDF)

Daume et al. (2007, 2010):*"Frustratingly Easy Domain Adaptation"*(PDF) and*"Frustratingly easy semi-supervised domain adaptation"*(PDF)

Bottou et al. (2013):*"Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising"*(PDF)

Barenboim and Pearl (2013):*"A general algorithm for deciding transportability of experimental results"*(PDF)- Introduction: fundamental understanding vs. statistical experiments vs. passive observation
- Example: Cholera epidemics in London ~1850
- Fundamental problem of causal inference: impossibility to play-back time and observe alternative outcomes
- Role of Bayesian networks in causal analysis:
- Basic tasks: prediction, counterfactual analysis, decision making
- Reichenbach's common cause principle
- Modeling interventions in a Bayesian network, Pearl's 'do' operator
- Distinction between conditional and interventional distributions, definition of the total causal effect
- Markov equivalence classes, skeleton and moral graph, Markov minimality
- True causal graphs: reproduction of both observational and interventional distributions, the minimal true causal graph
- How to identify causality from data?
- The idealized IC algorithm
- Optimization of the IC algorithm: the PC and the parallel stable PC algorithms
- Examples for BN construction
- The minimum number of experiments needed to completely identify a BN

- Conditional independence tests: the G-test and its shortcomings, kernel-based independence tests
- Structured equation models
- Approximation algorithms for BN construction:
- Move-making (score-based) algorithms
- The RESIT algorithm: causality detection by testing independence between predictors and
*residuals*

- Parameter estimation of a BN
- Avoiding omitted variable bias in causal analysis:
- Missing mediators in direct effect analysis, example: Berkeley admission
- Missing confounders in total effect analysis, example: kidney stone treatments
- Potential outcomes and exchangeability of treatment groups, the bias of naive conditional expectations
- Randomized experiments as gold standard
- Valid adjustment sets and backdoor adjustment
- Stratification
- Propensity scores, weight adjustment, propensity score matching

- Transfer learning:
- How to combine unsatisfactory data in the target domain with good data from a related domain?
- "Frustatingly easy domain adaptation" (EasyAdapt, EasyAdapt++)
- Data augmentation (adaptation of the training set to the new domain)
- Importance sampling for counterfactual queries, example: computational advertising
- Causal theory of transportability (adjustment formula for selection bias and transfer bias)