Loss function, gradient descent and back-propagation

Neural networks contain many parameters and compositions which allows them to fit highly nonlinear data surfaces. Gradient descent and back-propagation is often used to train this parameters. The first, gradient descent is the method used to optimize the weights of a composition of mappings, and the step-wise process utilized in feed-forward neural networks to pass errors backwards, is called back-propagation (see [22]). To explain gradient descent, consider an arbitrary finite composition of mappings

x7→y(x)ˆ B(f_K◦f_K−1◦ · · · ◦f1)(x) =f_K(f_K−1(. . .(f1(x)))) (5.8) where x∈R^m denotes an input vector and (fi)i≤K denotes a family of mappings fromR^mi−1→R^mi. The result ˆy(x) is the estimate of our target variable (henceforth referred to as the estimate or output). The target variable can either be a observed label or a regression value (as in supervised learning), or a target value formulated through an optimization criterion (as in unsupervised learning). An example of the latter isk-means, used in Chapter 6.

5.4.1 Loss and cost function

Aloss functionquantifies the error between the estimate and the target variable, i.e.

L(y,y),ˆ

where ˆyis the output of the mapping andyis the true output. A good loss function quantifies different types of errors in a desirable way, given the application.

Thus there is no universal good choice for loss function. Common choices are squared error,L(y,y) =ˆ ¹₂P

i(y_i−yˆ_i)²for regression (fitting a numerical value) and cross-entropy L(y,y) =ˆ −P

iy_ilog( ˆy_i) for classification/prediction. The choice depends on domain, target output and fit choices (for example heavily penalizing large values). Given a loss functionL, the cost function is given as

C= 1 N

j=1

L(yj,yˆj),

whereN denotes the number of samples, and hence the cost function is simply the average loss function. Caution is advised, since the names “loss-function”

and “cost function” are often used synonymously in the literature. We choose to distinguish between the two, as we later optimize the weights based on the cost function. The formulation of “optimizing the loss function” is confusing, as it may refer to changing the loss function to a different choice or optimizing it with gradient descent.

5.4.2 Gradient descent

In this section, we use gradient descent to adjust the weights in the composition of mappings in equation (5.8) with the objective of minimizing the cost function. We start with the cost function and view it as a function of an arbitrary parameterz. The linearity of differentation implies that

∂C

∂z =

j=1

∂L(yj,yˆ_j)

∂z .

Thus to decrease the notational load, we omit the sum and just study derivatives of the mapping

x−→ L(y,ˆy(x)) =L(y,(f_K◦f_K−1◦ · · · ◦f1)(x)).

Definez^[0]Bxandz^[k]Bf_k(z^[k⁻^1]) for 1≤k≤K. Supposef_j(x) =a(W x) for some matrixW ={w_ij}of suitable dimension and activation functiona(but it could be any mapping which introduces some parameterzfor which it is meaningful to differentiate the loss function). Provided that 1≤j < K, the chain rules implies that for a fixedw_ij,

∂L(yj,yˆ_j(wij))

∂w_ij = ∂L

∂a^[K]

∂a^[K⁻^1]

· · ·∂a^[j]

∂w_ij,

where the derivatives/fractions are the Jacobian matrices. The structure of feed-forward neural networks allows us to compute each of these Jacobian matrices in a step-wise procedure called back-propagation which we elaborate on in the next subsection. Letw⁽⁰⁾_ij denote the initial value ofw_ij. The parameter update at iterationtof gradient descent is given by

w_ij^(t)=w^(t_ij⁻¹⁾−α∂L(yj,yˆ_j)

∂w_ij ,

whereαdenotes the learning rate (a hyperparameter, see Section 5.5.4). The process of “back-propagating” errors (or derivatives) stepwise towards the input is called backpropagation (see [22]) and is the general method used to optimize feed-forward neural networks with multiple layers. Certain optimization algorithms may modify how this optimization is performed.

5.4.3 Back-propagation

In this section, we describe back-propagation as a stepwise process of updating the weights in a feed-forward neural network. We lean heavily on the excellent explanation provided in Chapter 5 of the book [1]. Consider the composition of mappings (or layers) from Subsection 5.3.1, i.e.

z^[j]Ba^[j](W^[j]z^[j⁻^1]), for 1≤j≤K,

wherea^[j]is an activation function,W^[j]is a matrix of suitable dimension and z^[j]is a vector withz^[0]=x, wherexis the input vector of the feed-forward

neural network, and finallyz^[K]is the output of the feed-forward neural network (i.e. the result of the last mapping). Define the notation (hshould not be interpreted as hidden unit – we merely needed more notation to simplify calculations later)

h^[j]=W^[j]z^[j⁻^1], 1≤j≤K.

To measure the error, we use a loss functionL(y,z^[K]) and the first goal is to compute the derivative

∂L(y,z^[K])

∂w_ij^[K]

wherew^[K]_ij is the (i, j)’th entry of the matrixW^[K]. Sincew^[K]_ij only enters in the ith coordinate ofz^[K]by equation (5.7) and combine this with the chain rule, we may write this as

∂L(y, z^[K])

∂w_ij^[K]

=∂L(y,z^[K])

∂z^[K]_i

∂a^[K]

∂h^[K]_i

∂w^[K]_ij .

Define for eachjthe notationδ^[K]_j by δ^[K]_j B ∂L(y,z^[K])

∂z^[K]_i

∂a^[K]

∂h^[K]_i ,

and refer toδ^[K]_i as the errors in theK’th layer. Observe that

∂h^[K]_i

∂w^[K]_ij

=z^[K_j ⁻^1], and thus the overall derivative becomes

∂L(y,z^[K])

∂w^[K]_ij

=δ_i^[K]z^[K_j ⁻^1].

Similarly, defineδ_j^[K⁻^1]for the layer [K−1] by δ^[K_j ⁻^1]B ∂L

∂h^[K_j ⁻^1]

. We may rewrite this derivative using the chain rule

δ_j= ∂L

∂h^[K]

T ∂h^[K]

∂h^[K_j ⁻^1]

∂L

∂h^[K]_k

∂h^[K_j ⁻^1]

from which we may simplify the notation to δ^[K_j ⁻^1]=X

δ_k^[K]w_kj∂a^[K⁻^1]

∂h^[K_j ⁻^1]

=∂h^[K_j ⁻^1]X

δ^[K]_k w^[K]_kj .

This relation holds for any givenδ^[s]_j for 1≤s≤Kby iteration. The procedure of calculating theδthrough the previously calculatedδs is calledback-propagation.

The take-away is that we may computeδ_j^[s]through the previously calculated δ’s (by iterating this procedure backwards) and obtain the desired derivatives for gradient descent through the formula

∂L

∂w^[s]_ij = ∂L

∂h^[s]_i

∂w^[s]_ij =δ^[s]_i z_j^[s⁻^1], wherew^[s]_ij denotes the (i, j)’th entry ofW^[s]in thesth layer.

No documento PhD Dissertation - Department of Mathematics (páginas 78-81)