alterar w no sentido oposto ao do vetor gradiente

(1)

Gradient descent

SejaJ(w)a fun¸c˜ao custo a ser minimizada.

Algoritmo em linhas gerais

• chutar um valor para w

• calcular o gradiente de J no pontow

(“dire¸c˜ao de maior inclina¸c˜ao”)

(2)

Ilustra¸c˜ao do gradient descent

Fun¸c˜ao custo

Peso inicial

(3)

Exemplo: fun¸c˜ao custo usado em regress˜ao linear J(w) = 1 N N X n=1 y(n)_{− h}w(x(n)) | {z } ˆ y(n)_=wT_x(n) 2

• o que aqui est´a denotado J, no livro do Mostafa ´e Ein

• às vezes o _N1 não aparece; às vezes aparece um 1₂ no lugar

• Lembre que pare este problema temos uma solu¸c˜ao anal´ıtica (que usa a pseudo-inversa)

(4)

Vetor gradiente deJ: ∇J(w) =h ∂J_∂w 0 , ∂J ∂w1 , . . . , ∂J ∂wd iT ∂J ∂wj = ∂ ∂wj 1 2 X n (y(n)_{− ˆy}(n))2 = 1 2 X n ∂ ∂wj (y(n)_{− ˆy}(n))2 = 1 2 X n 2(y(n)_{− ˆy}(n)) ∂ ∂wj (y(n)_{− ˆy}(n)) = X n (y(n)_{− ˆy}(n)) ∂ ∂wj yn− (w0+ w1x (n) 1 +. . . + wjx (n) j +. . . + wdx (n) d ) = ₋X n (y(n)_{− ˆy}(n)) x_j(n)

(o componentejdo vetor gradiente depende do erro(y(n)− ˆy(n))e os componentesjde todos os pontosx(n))

(5)

M´

etodo do gradiente descendente

Vetor gradiente deJ: _{∇J(w) =}h_∂w∂J 0, ∂J ∂w1, . . . , ∂J ∂wd i ∂J ∂wj =− X n (y(n)_{− ˆy}(n)) x_j(n) Peso inicial: w(0)

Regra de atualiza¸c˜ao ( itera¸c˜ao r ):

w(r + 1) = w(r ) +η∆w(r )

∆w(r ) =_−∇J(w), ∆wj(r ) =

X

n

(y(n)_{− ˆy}(n)) x_j(n)

(6)

Pseudo-c´

odigo

Algorithm 1 GradientDescent Input: D,η,iter

Output: w

w_← small random value repeat ∆wj ← 0, j = 0, 1, 2, . . . , d for all (x, y ) in D do computey = wˆ Tx ∆wj ← ∆wj + (y− ˆy) xj, j = 0, 1, 2, . . . , d end for wj ← wj +η∆wj, j = 0, 1, 2, . . . , d

until number of iterations =iter

(7)

Pseudo-c´

odigo

Algorithm 2 Stochastic GradientDescent Input: D,η,iter

Output: w

w← small random value repeat

for all (x, y ) in D do computey = wˆ Tx

wj ← wj +η(y− ˆy) xj, j = 0, 1, 2, . . . , d

end for

(8)

Fixed-size step?

Howηae tsthealgorithm: PSfragrepla ements Weights,w In-sample Erro r, Ein -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PSfragrepla ements Weights,w In-sample Erro r, Ein -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 Weights,w In-sample Erro r, Ein largeη smallη -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

ηtoosmall ηtoolarge variableηjustright

ηshouldin reasewiththeslope

(9)

Observation: the example shown previously is the application of the gradient descent to the MSE (mean squared error) cost function. The algorithm can be applied to minimize other cost functions (and, obviously, the gradient vector shoul be computed accordingly).

(10)

Mais um pouco sobre

Gradient Descent_× Stochastic gradient descent

(11)

GDminimizes: E in(w) = 1 N N X n=1 e h(xn), yn | {z } ln(1+e−ynwTxn ) ←−inlogisti regression

byiterativestepsalong−∇E in

:

∆w = − η ∇Ein(w)

∇Ein

isbasedonallexamples(xn, yn) bat hGD

(12)

Pi kone(xn, yn)atatime. ApplyGDto e h(xn), yn Averagedire tion: E n −∇e h(xn), yn = 1 N N X n=1 −∇e h(xn), yn =_{−∇ E}in randomizedversionofGD

sto hasti gradientdes ent(SGD)

(13)

Benets of SGD Weights,w Ein 1 2 3 4 5 6 0 0.05 0.1 0.15 randomization helps 1. heaper omputation 2.randomization 3.simple Rule of thumb: η = 0.1 works

(14)

Gradient descent

https://imaddabbura.github.io/post/gradient_descent_algorithms/