• Nenhum resultado encontrado

OMD Connection to Proximal Operators

and letgt∈∂ft(xt) be as in the definition ofEOMDXR(f) for eacht∈[T]. Then, x1 =d−11and xt+1(i) = 1

ωt+1(xt(i))e−ηgt(i), whereωt+1 :=

d

X

j=1

(xt(j))e−ηgt(j) ∀i∈[d],∀t∈ {1, . . . , T −1}.

Proof. By Lemma 5.2.3, we know thatR is a mirror map for ∆d. Thus, it only remains to prove the form of the points xt for t∈[T]. First of all, by Proposition 3.11.5 we have

{d−11}= arg min

x∈∆d

BR(x,1) = arg min

x∈∆d

(R(x) +kxk1− k1k1) = arg min

x∈∆d

(R(x) + 1−d) = arg min

x∈∆d

R(x), and since x1 ∈arg minx∈∆dR(x) by definition, we conclude thatx1 =d−11. Let yt∈Rd be as in the definition ofEOMDXR(f) for eacht∈[T]. For every t∈ {1, . . . , T −1} we have by definition of the algorithm that

yt+1(i) =∇R(xt)(i)−gt(i) = η1

1 + ln xt(i)

−gt(i), ∀i∈[d].

By Proposition 3.4.4,R(z) = 1ηPd

i=1eηz(i)−1 , we for every z∈Rd, which implies

∇R(z)(i) =eηz(i)−1 = exp(ηzi−1), ∀i∈[d],∀z∈Rd. Therefore, for everyt∈ {1, . . . , T −1}

∇R(yt+1)(i) = exp(η(yt+1(i))−1) = exp(1 + ln(xt(i))−ηgt(i)−1) = (xt(i))e−ηgt(i), ∀i∈[d].

Sincext+1= ΠR

d(∇R(yt+1)), and since by Proposition 3.11.5 the Bregman Projector on∆dw.r.t.R boils down to a normalization w.r.t. the`1-norm, we are done.

The update rules derived by the mirror maps in the above propositions are simple to implement and well-studied in the optimization literature. Thus, one may hope to unify convergence proofs of many existing algorithms with a general convergence proof of Online Mirror Descent. Indeed, Adaptive OMD happens to have many interesting connections with proximal operators and with Adaptive FTRL. The connections with the former allows us to write the general update rule from AdaOMD in a closed formula which sheds light on its inner workings. The connection with Adaptive FTRL gives yet another perspective on how AdaOMD works, and yields regret bounds through the use of the Strong FTRL Lemma.

Lemma 3.9.14 that the above set is indeed a singleton. The nameproximal comes from the fact that if we usef :=δ(· |X) for some closed setX ⊆E, the point given by the proximal operator forx¯∈E is exactly (one of) the closest point(s) to x¯ inX (w.r.t. the `2-norm).

If we look more carefully at (5.5), we can see a clear intuition of the inner workings ofproxf(¯x): it tries to find a point that approximately minimizesf without going too far away fromx. Interestingly,¯ if we want to find a point which is allowed to be closer (or farther) from x¯ when compared to proxf(¯x) while still approximately minimizing f, it suffices to apply the proximal operator to a positive multiple off. To see this, note that for any λ∈R++ andx¯∈Ewe have

{proxλf(¯x)}= arg min

x∈E

λf(x) +1

2kx−xk¯ 22

= arg min

x∈E

f(x) + 1

2λkx−xk¯ 22 .

Thus, ifλ∈R++is bigger than1, the pointproxλf(¯x)is allowed to be farther from x¯thanproxf(¯x).

In the same way, if λ∈R++ is smaller than1, proxλf(¯x)is probably closer to x¯thanproxf(¯x). The intuition on proximal operator we have discussed is close to the problem of minimizingf. Thus, comes as no surprise that some of the proximal operator’s properties are closely related to the the minimizers off onE(if any). Indeed, the following proposition4 shows one (of the many) interesting properties of the proximal map. It shows that the fixed points ofproxf are exactly the minimizers of f.

Proposition 5.3.1. Letf:E→(−∞,+∞]be a proper closed convex function and letx¯∈E. Then proxf(¯x) = ¯x if and only ifx¯ attainsinfx∈Ef(x).

Proof. Suppose x¯ attainsinfx∈Ef(x). Then, for everyx∈E,

f(x) +12kx−xk¯ 22 ≥f(x)≥f(¯x) =f(¯x) +12k¯x−xk¯ 22.

Suppose now that proxf(¯x) = ¯x and define h(x) := f(x) + 12kx−xk¯ 22 for everyx ∈E. This way, we have that x¯ attain infx∈Eh(x). Note thatx ∈E7→ 12kx−xk¯ 22 is a differentiable function with gradient x−x¯ at x∈E. Thus, by Theorems 3.5.4 and 3.5.5, for any x∈Ewe have

∂h(x) =∂f(x) +x−x.¯

By the definition of subgradient, we have that x¯ attains infx∈Eh(x) if and only if 0 ∈ ∂h(¯x) =

∂f(¯x) + ¯x−x¯=∂f(¯x), that is, x¯attains infx∈Ef(x).

Since the idea behindproxf is to (at least approximately) minimizef, it is natural to ask whether it can be applied to optimization. However, the proximal map itself is an optimization problem, and if it is not efficiently solvable, it is of no help in optimizingf efficiently. Luckily, in many applications of proximal operators there are efficient ways to compute this operator. One example is when the functionf to be minimized can be written as a sum of two functions, one differentiable but “hard”

to optimize over, and one non-smooth function but easier to handle in the proximal map. In this case, one can handle the smooth function with a gradient step, the non-smooth part one handles by computing the proximal map at the current iterate. This line of thought is not our focus here, but one can find more information and related work in [57].

Another idea to make the proximal map proxf for a proper convex functionf:E→(−∞,+∞]

on a given pointx¯∈Euseful in optimizing f even when we cannot computeproxf efficiently is to substitute f by an approximationf˜of f which is easier to handle. Iff is subdifferentiable atx¯and

4This proposition is not used in the remainder of the text, but it aids to build intuition about proximal operators.

g∈∂f(¯x), then an intuitive approximation candidate is given byf˜(x) :=f(¯x) +hg, x−xi¯ for every x∈E. With this function f, we have˜

{proxf˜(¯x)}= arg min

x∈E

hg, xi+12kx−xk¯ 22 .

By the definition of subgradient,0must be a subgradient atproxf˜(¯x)of the function being minimized above. Since x ∈ E → hg, xi + 12kx−xk¯ 22 is a proper closed convex everywhere differentiable function, by Theorem 3.5.5 we know that its only subgradient is its gradient. Thus, the gradient of x∈E→ hg, xi+12kx−xk¯ 22 at proxf˜(¯x)is 0, that is,

0 =g+ proxf˜(¯x)−x¯ ⇐⇒ proxf˜(¯x) = ¯x−g.

Therefore, the proximal map in this case is just a step in the direction of minus subgradient! Here we used a step size of 1for the sake of simplicity, but as we have discussed, one can just scalef˜to get a subgradient descent with an arbitrary positive step size.

In the above discussion, we have seen how to view a subgradient step from a point x for the functionf as the application of a proximal map based on a subgradient of f at x. Moreover, in the previous section we have seen that the online subgradient descent algorithm is just a special case of online mirror descent. Thus, one may ask if there is any generalized proximal operator which generalizes many algorithms in a way similar to the way OMD generalizes many Online Convex Optimization algorithms. Indeed, the first appearance of the proximal operator, which is exactly the one we defined asproxf, uses the squared `2-norm as a way to measure distances. One possible generalization replaces this norm with a distance-like function D: E×E → (−∞,+∞]. In this way, the form of this generalized proximal operator when applied to a linear function5 hg,·i at a pointx¯∈Eis

arg min

x∈E

{hg, xi+D(x,x)}.¯

The next theorem shows that each iterate of the Adaptive OMD is given by a formula as the one above, with the minimization taking place in the setX instead of the whole space. Additionally, the distance-like functions used in the theorem seem incredibly intuitive given what we have already seen about OMD: it is the Bregman divergence w.r.t. the mirror map at the current round. Moreover, the proof of the next theorem follows the lines of the discussion we had above about the subgradient step, that is, we just look at the optimality conditions of the generalized proximal step, and conclude that the generalized proximal is exactly the Adaptive OMD step.

Theorem 5.3.2. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX. LetR: Seq(F)→(−∞,+∞]E be a mirror map strategy forC, letENEMY be an enemy oracle forC, and letT ∈N. Define

(x,f) := OCOC(AdaOMDXR,ENEMY, T) and Rt:=

t

X

i=1

R(hf1, . . . , ft−1i), for eacht∈[T].

Moreover, letgt∈∂ft(xt) be the same as in the definition ofAdaOMDXR(f) on Algorithm 5.1 for each t∈[T]. Set {x01}:= arg minx∈XR1(x) and define

{x0t+1}:= arg min

x∈X

hgt, xi+BRt+1(x, x0t)

, ∀t∈ {1, . . . , T −1}. (5.6) Then xt=x0t for eacht∈[T].

5Recall that we saw in the discussion how the proximal applied to thelinearized functionyields the subgradient step, not the proximal operator applied to the function itself.

Proof. Let us prove the statement by induction ont∈[T]. For t= 1 the statement holds by the definitions ofx1 and x01. Let t∈ {1, . . . , T −1}and suppose xt=x0t. Define

yt+1:=∇Rt+1(xt)−gt=∇Rt+1(x0t)−gt.

By the definition of AdaOMDR on Algorithm 5.1, we have that xt+1 = ΠRXt+1(∇Rt+1(yt+1)).

Proposition 5.1.3 states that, since Rt+1 is a mirror map forX, thenRt+1 is differentiable onEand Rt+1 is differentiable on∇Rt+1(yt+1)∈int(domRt+1). Thus, since(int(domRt+1))∩ri(X)6=∅by the definition of mirror map, Lemma 3.11.4 yields

xt+1 = ΠRXt+1(∇Rt+1(yt+1)) ⇐⇒ yt+1− ∇Rt+1(xt+1)∈NX(xt+1). (5.7) Note that

yt+1− ∇Rt+1(xt+1) =∇Rt+1(x0t)−gt− ∇Rt+1(xt+1) =−gt−(∇BRt+1(·, x0t))(xt+1), and the latter is minus the gradient ofx∈E7→gTtx+BRt+1(x, x0t) atxt+1. Therefore, using (5.7) with the above equation and by the optimality conditions from Theorem 3.6.2,

xt+1 = ΠRXt+1(∇Rt+1(yt+1)) ⇐⇒ −(gt+ (∇BRt+1(·, x0t))(xt+1))∈NX(xt+1)

⇐⇒ xt+1 ∈arg min

x∈X

hgt, xi+BRt+1(x, x0t) .

This connection between Adaptive OMD and generalized proximal operators gives us yet another intuitive view of online mirror descent. At round t, the algorithm minimizes, as much as possible, the value of the linearized version of the functionft−1 played by enemy on the last round, but tries to not pick a point too far from the last iterate w.r.t. the Bregman divergence based on the current mirror map. In the next section, we will see that the Adaptive OMD can be seen as an application of the Adaptive FTRL algorithm.