Gradient descent
SejaJ(w)a fun¸c˜ao custo a ser minimizada.
Algoritmo em linhas gerais
• chutar um valor para w
• calcular o gradiente de J no pontow
(“dire¸c˜ao de maior inclina¸c˜ao”)
Ilustra¸c˜ao do gradient descent
Fun¸c˜ao custo
Peso inicial
Exemplo: fun¸c˜ao custo usado em regress˜ao linear J(w) = 1 N N X n=1 y(n)− hw(x(n)) | {z } ˆ y(n)=wTx(n) 2
• o que aqui est´a denotado J, no livro do Mostafa ´e Ein
• `as vezes o N1 n˜ao aparece; `as vezes aparece um 12 no lugar
• Lembre que pare este problema temos uma solu¸c˜ao anal´ıtica (que usa a pseudo-inversa)
Vetor gradiente deJ: ∇J(w) =h ∂J∂w 0 , ∂J ∂w1 , . . . , ∂J ∂wd iT ∂J ∂wj = ∂ ∂wj 1 2 X n (y(n)− ˆy(n))2 = 1 2 X n ∂ ∂wj (y(n)− ˆy(n))2 = 1 2 X n 2(y(n)− ˆy(n)) ∂ ∂wj (y(n)− ˆy(n)) = X n (y(n)− ˆy(n)) ∂ ∂wj yn− (w0+ w1x (n) 1 +. . . + wjx (n) j +. . . + wdx (n) d ) = −X n (y(n)− ˆy(n)) xj(n)
(o componentejdo vetor gradiente depende do erro(y(n)− ˆy(n))e os componentesjde todos os pontosx(n))
M´
etodo do gradiente descendente
Vetor gradiente deJ: ∇J(w) =h∂w∂J 0, ∂J ∂w1, . . . , ∂J ∂wd i ∂J ∂wj =− X n (y(n)− ˆy(n)) xj(n) Peso inicial: w(0)Regra de atualiza¸c˜ao ( itera¸c˜ao r ):
w(r + 1) = w(r ) +η∆w(r )
∆w(r ) =−∇J(w), ∆wj(r ) =
X
n
(y(n)− ˆy(n)) xj(n)
Pseudo-c´
odigo
Algorithm 1 GradientDescent Input: D,η,iter
Output: w
w← small random value repeat ∆wj ← 0, j = 0, 1, 2, . . . , d for all (x, y ) in D do computey = wˆ Tx ∆wj ← ∆wj + (y− ˆy) xj, j = 0, 1, 2, . . . , d end for wj ← wj +η∆wj, j = 0, 1, 2, . . . , d
until number of iterations =iter
Pseudo-c´
odigo
Algorithm 2 Stochastic GradientDescent Input: D,η,iter
Output: w
w← small random value repeat
for all (x, y ) in D do computey = wˆ Tx
wj ← wj +η(y− ˆy) xj, j = 0, 1, 2, . . . , d
end for
Fixed-size step?
Howηae tsthealgorithm: PSfragrepla ements Weights,w In-sample Erro r, Ein -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 PSfragrepla ements Weights,w In-sample Erro r, Ein -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 Weights,w In-sample Erro r, Ein largeη smallη -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
ηtoosmall ηtoolarge variableηjustright
ηshouldin reasewiththeslope
Observation: the example shown previously is the application of the gradient descent to the MSE (mean squared error) cost function. The algorithm can be applied to minimize other cost functions (and, obviously, the gradient vector shoul be computed accordingly).
Mais um pouco sobre
Gradient Descent× Stochastic gradient descent
GDminimizes: E in(w) = 1 N N X n=1 e h(xn), yn | {z } ln(1+e−ynwTxn ) ←−inlogisti regression
byiterativestepsalong−∇E in
:
∆w = − η ∇Ein(w)
∇Ein
isbasedonallexamples(xn, yn) bat hGD
Pi kone(xn, yn)atatime. ApplyGDto e h(xn), yn Averagedire tion: E n −∇e h(xn), yn = 1 N N X n=1 −∇e h(xn), yn =−∇ Ein randomizedversionofGD
sto hasti gradientdes ent(SGD)
Benets of SGD Weights,w Ein 1 2 3 4 5 6 0 0.05 0.1 0.15 randomization helps 1. heaper omputation 2.randomization 3.simple Rule of thumb: η = 0.1 works
Gradient descent
https://imaddabbura.github.io/post/gradient_descent_algorithms/