Deep Learning & Artificial Intelligence

(1)

Deep Learning & Artificial Intelligence

WS 2018/2019

(2)

Linear Regression

(3)

Model

(4)

Model

(5)

Error Function: Squared Error

Prediction Ground truth / target

Has no special meaning except it makes gradients look nicer

(6)

Objective Function

… with a single example

… with a set of examples

(7)

Objective Function

Solution

(8)

Closed Form Solution

(9)

Closed Form Solution

 Fast to compute

 Only exists for some models and error functions

 Must be determined manually

(10)

Gradient Descent

(11)

Gradient Descent

1. Initialize at random 2. Compute error

3. Compute gradients w.r.t. parameters 4. Apply the above update rule

5. Go back to 2. and repeat until error does not decrease anymore

(12)

Computing Gradients

(13)

Computing Gradients

Kronecker delta

(14)

Computing Gradients

(15)

Computing Gradients

(16)

Gradient Descent (Result)

1. Initialize at random 2. Compute error

3. Compute gradients w.r.t. parameters 4. Apply the above update rule

5. Go back to 2. and repeat until error does not decrease anymore

(17)

Probabilistic Interpretation

Error term that captures

unmodeled effects or random noise

(18)

(19)

(20)

Likelihood

(21)

Maximum Likelihood

(22)

Log-Likelihood

(23)

Maximum Log-Likelihood

(24)

Neural Networks & Backpropagation

(25)

Error Function

Prediction Ground truth / target

(26)

Simple Fully-Connected Neural Network

(27)

Objective Function

… with a single example

… with a set of examples

(28)

Gradients: Towards Backpropagation

(29)

Number of neurons of the layer

(excluding bias “1”)

(30)

(31)

Can you do it for on your own?

(32)

(33)

(34)

(35)

(36)

(37)

Can you do it for on your own?

(38)

Backpropagation

“Delta messages”

(39)

Activation Functions

&

Vanishing Gradients

(40)

Common Activation Functions

(41)

Common Activation Functions

Small or even tiny gradient

(42)

Vanishing Gradients

Element-wise multiplication with small or even tiny

gradients for each layer

In a neural network with many layers, the gradients of the objective function w.r.t. the weights of a layer

close to the inputs may become near zero!

⇒ Gradient descent updates will starve

(43)

Weight Initialization

(44)

The Importance of Weight Initialization

● Simple CNN trained on MNIST for 12 epochs

● 10-batch rolling average of training loss

Image Source: https://intoli.com/blog/neural-network-initialization/

(45)

The Importance of Weight Initialization

Initialization with “0” values is ALWAYS WRONG!

How to initialize properly?

0 here = everything is 0 = no error signal

(46)

Information Flow in a Neural Network

Consider a network with ...

● 5 hidden layers and 100 neurons per hidden layer

● the hidden layer activation function = identity function Let’s omit the bias term for simplicity (commonly initialized with all 0’s).

(47)

Image Source: https://intoli.com/blog/neural-network-initialization/

(48)

What’s the explanation for the previous image?

One layer with some activation function and without the bias term:

(49)

(50)

(51)

(1) tends to 0 when either (2) tends to 0 or (3) tends to 0.

⇒ Preserve variance of activations throughout the network.

(1) (2) (3)

(52)

Variance approximation possible when pre-activation neurons are close to zero.

(53)

Variance

Basic properties of variance for independent random variables with expected value = 0

(54)

Variance of Activations

Random variables

(55)

(56)

Variance preservation

(57)

Variance of Error Contribution

(58)

(59)

assumption

(60)

Random variables

(61)

(62)

Variance preservation

(63)

“Glorot” Initialization

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th international conference on artificial intelligence and statistics (pp. 249-256).

(64)

Optimization Methods

(65)

Gradient Descent

Martens, J. (2010). Deep Learning via Hessian-Free Optimization. In Proceedings of the 27th International Conference on Machine Learning (pp.

735-742).

Too large learning rate

⇒ zig-zag Too small learning rate

⇒ starvation

(66)

Batch Gradient Descent

● Update based on the entire training data set

● Susceptible to converging to local minima

● Expensive and inefficient for large training data sets

(67)

Stochastic Gradient Descent (SGD)

● Update based on a single example

● More robust against local minima

● Noisy updates ⇒ small learning rate

(68)

Mini-Batch Gradient Descent

● Update based on multiple examples

● More robust against local minima

● More stable than stochastic gradient descent

● Most common

● Often also called SGD despite multiple examples

(69)

Gradient Descent with Momentum

● Momentum dampens oscillations

● Gradient is computed before momentum is applied

● Typical momentum term:

(70)

Gradient Descent with Nesterov Momentum

● Gradient is computed after momentum is applied

● Anticipated update from momentum is used to include knowledge of momentum in the gradient

● Typically preferred over vanilla momentum

(71)

AdaGrad

● Adaptive (per-weight) learning rates

● Learning rates of frequently occurring features are reduced while learning rates of infrequent features remain large

● Monotonically decreasing learning rates

● Suited for sparse data

● Typical learning rate:

(72)

RMSProp

Typical hyperparameters:

(73)

Adam

● Often used these days

● Typical hyperparameters:

(74)

Computation Graphs

(75)

Matrix-Vector Multiplication

MATMUL

W y

x

MATRIX

float VECTOR float VECTOR

float

SYMBOL TYPE data type

OPERATION

symbolic variable

(76)

Indexing

INDEXING

A B

i

MATRIX

float VECTOR int MATRIX

float

2 5 0

A i B

(77)

Graph Optimization

MULTIPLY

x

SCALAR

float SCALAR float SCALAR

float

DIVIDE

SCALAR z

float

OPTIMIZATION

x

SCALAR float

y

(78)

Automatic Differentiation

SQUARE

x y

SCALAR float SCALAR

float

GRAD(y, x)

2

SCALAR float

MULTIPLY

dy/dx

SCALAR float

(79)

Neural Network Layers

MATMUL

W x

MATRIX

float VECTOR float VECTOR

float ^ADD

b

VECTOR float

a _TANH

z

VECTOR float

LAYER OP DENSE

z

x

VECTOR float VECTOR

float