• Nenhum resultado encontrado

Deep Learning & Artificial Intelligence

N/A
N/A
Protected

Academic year: 2023

Share "Deep Learning & Artificial Intelligence"

Copied!
79
0
0

Texto

(1)

Deep Learning & Artificial Intelligence

WS 2018/2019

(2)

Linear Regression

(3)

Model

(4)

Model

(5)

Error Function: Squared Error

Prediction Ground truth / target

Has no special meaning except it makes gradients look nicer

(6)

Objective Function

… with a single example

… with a set of examples

(7)

Objective Function

Solution

(8)

Closed Form Solution

(9)

Closed Form Solution

Fast to compute

Only exists for some models and error functions

Must be determined manually

(10)

Gradient Descent

(11)

Gradient Descent

1. Initialize at random 2. Compute error

3. Compute gradients w.r.t. parameters 4. Apply the above update rule

5. Go back to 2. and repeat until error does not decrease anymore

(12)

Computing Gradients

(13)

Computing Gradients

Kronecker delta

(14)

Computing Gradients

(15)

Computing Gradients

(16)

Gradient Descent (Result)

1. Initialize at random 2. Compute error

3. Compute gradients w.r.t. parameters 4. Apply the above update rule

5. Go back to 2. and repeat until error does not decrease anymore

(17)

Probabilistic Interpretation

Error term that captures

unmodeled effects or random noise

(18)

Probabilistic Interpretation

Error term that captures

unmodeled effects or random noise

(19)

Probabilistic Interpretation

Error term that captures

unmodeled effects or random noise

(20)

Likelihood

(21)

Maximum Likelihood

(22)

Log-Likelihood

(23)

Maximum Log-Likelihood

(24)

Neural Networks & Backpropagation

(25)

Error Function

Prediction Ground truth / target

(26)

Simple Fully-Connected Neural Network

(27)

Objective Function

… with a single example

… with a set of examples

(28)

Gradients: Towards Backpropagation

(29)

Gradients: Towards Backpropagation

Number of neurons of the layer

(excluding bias “1”)

(30)

Gradients: Towards Backpropagation

(31)

Gradients: Towards Backpropagation

Can you do it for on your own?

(32)

Gradients: Towards Backpropagation

(33)

Gradients: Towards Backpropagation

(34)

Gradients: Towards Backpropagation

(35)

Gradients: Towards Backpropagation

(36)

Gradients: Towards Backpropagation

(37)

Gradients: Towards Backpropagation

Can you do it for on your own?

(38)

Backpropagation

“Delta messages”

(39)

Activation Functions

&

Vanishing Gradients

(40)

Common Activation Functions

(41)

Common Activation Functions

Small or even tiny gradient

(42)

Vanishing Gradients

Element-wise multiplication with small or even tiny

gradients for each layer

In a neural network with many layers, the gradients of the objective function w.r.t. the weights of a layer

close to the inputs may become near zero!

⇒ Gradient descent updates will starve

(43)

Weight Initialization

(44)

The Importance of Weight Initialization

Simple CNN trained on MNIST for 12 epochs

10-batch rolling average of training loss

Image Source: https://intoli.com/blog/neural-network-initialization/

(45)

The Importance of Weight Initialization

Initialization with “0” values is ALWAYS WRONG!

How to initialize properly?

0 here = everything is 0 = no error signal

(46)

Information Flow in a Neural Network

Consider a network with ...

5 hidden layers and 100 neurons per hidden layer

the hidden layer activation function = identity function Let’s omit the bias term for simplicity (commonly initialized with all 0’s).

(47)

Information Flow in a Neural Network

Image Source: https://intoli.com/blog/neural-network-initialization/

(48)

Information Flow in a Neural Network

What’s the explanation for the previous image?

One layer with some activation function and without the bias term:

(49)

Information Flow in a Neural Network

(50)

Information Flow in a Neural Network

(51)

Information Flow in a Neural Network

(1) tends to 0 when either (2) tends to 0 or (3) tends to 0.

⇒ Preserve variance of activations throughout the network.

(1) (2) (3)

(52)

Information Flow in a Neural Network

Variance approximation possible when pre-activation neurons are close to zero.

(53)

Variance

Basic properties of variance for independent random variables with expected value = 0

(54)

Variance of Activations

Random variables

(55)

Variance of Activations

(56)

Variance of Activations

Variance preservation

(57)

Variance of Error Contribution

(58)

Variance of Error Contribution

(59)

Variance of Error Contribution

assumption

(60)

Variance of Error Contribution

Random variables

(61)

Variance of Error Contribution

(62)

Variance of Error Contribution

Variance preservation

(63)

“Glorot” Initialization

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th international conference on artificial intelligence and statistics (pp. 249-256).

(64)

Optimization Methods

(65)

Gradient Descent

Martens, J. (2010). Deep Learning via Hessian-Free Optimization. In Proceedings of the 27th International Conference on Machine Learning (pp.

735-742).

Too large learning rate

⇒ zig-zag Too small learning rate

⇒ starvation

(66)

Batch Gradient Descent

Update based on the entire training data set

Susceptible to converging to local minima

Expensive and inefficient for large training data sets

(67)

Stochastic Gradient Descent (SGD)

Update based on a single example

More robust against local minima

Noisy updates small learning rate

(68)

Mini-Batch Gradient Descent

Update based on multiple examples

More robust against local minima

More stable than stochastic gradient descent

Most common

Often also called SGD despite multiple examples

(69)

Gradient Descent with Momentum

Momentum dampens oscillations

Gradient is computed before momentum is applied

Typical momentum term:

(70)

Gradient Descent with Nesterov Momentum

Gradient is computed after momentum is applied

Anticipated update from momentum is used to include knowledge of momentum in the gradient

Typically preferred over vanilla momentum

(71)

AdaGrad

Adaptive (per-weight) learning rates

Learning rates of frequently occurring features are reduced while learning rates of infrequent features remain large

Monotonically decreasing learning rates

Suited for sparse data

Typical learning rate:

(72)

RMSProp

Typical hyperparameters:

(73)

Adam

Often used these days

Typical hyperparameters:

(74)

Computation Graphs

(75)

Matrix-Vector Multiplication

MATMUL

W y

x

MATRIX

float VECTOR float VECTOR

float

SYMBOL TYPE data type

OPERATION

symbolic variable

(76)

Indexing

INDEXING

A B

i

MATRIX

float VECTOR int MATRIX

float

2 5 0

A i B

(77)

Graph Optimization

MULTIPLY

x

SCALAR

float SCALAR float SCALAR

float

DIVIDE

SCALAR z

float

OPTIMIZATION

x

SCALAR float

y

(78)

Automatic Differentiation

SQUARE

x y

SCALAR float SCALAR

float

GRAD(y, x)

2

SCALAR float

MULTIPLY

dy/dx

SCALAR float

(79)

Neural Network Layers

MATMUL

W x

MATRIX

float VECTOR float VECTOR

float ADD

b

VECTOR float

a TANH

z

VECTOR float

VECTOR float

LAYER OP DENSE

z

x

VECTOR float VECTOR

float

Imagem

Graph Optimization

Referências

Documentos relacionados

Neben zwei Testverfahren für die Bewertung des wässrigen Eluats mariner Sedimente Leuchtbakterientest und mariner Algentest war vor allem auch die Etablierung und Weiterentwicklung