OMD Connection to Proximal Operators - Online Convex Optimization: Algorithms, Learning, and Du

and letg_t∈∂f_t(x_t) be as in the definition ofEOMD^X_R(f) for eacht∈[T]. Then, x₁ =d⁻¹1and x_t+1(i) = 1

ω_t+1(x_t(i))e^−ηg^t⁽ⁱ⁾, whereω_t+1 :=

j=1

(x_t(j))e^−ηg^t^(j) ∀i∈[d],∀t∈ {1, . . . , T −1}.

Proof. By Lemma 5.2.3, we know thatR is a mirror map for ∆_d. Thus, it only remains to prove the form of the points xt for t∈[T]. First of all, by Proposition 3.11.5 we have

{d⁻¹1}= arg min

x∈∆_d

BR(x,1) = arg min

x∈∆_d

(R(x) +kxk₁− k1k₁) = arg min

x∈∆_d

(R(x) + 1−d) = arg min

x∈∆_d

R(x), and since x₁ ∈arg min_x∈∆_dR(x) by definition, we conclude thatx₁ =d⁻¹1. Let y_t∈R^d be as in the definition ofEOMD^X_R(f) for eacht∈[T]. For every t∈ {1, . . . , T −1} we have by definition of the algorithm that

y_t+1(i) =∇R(x_t)(i)−g_t(i) = _η¹

1 + ln x_t(i)

−g_t(i), ∀i∈[d].

By Proposition 3.4.4,R^∗(z) = ¹_ηPd

i=1e^ηz(i)−1 , we for every z∈R^d, which implies

∇R^∗(z)(i) =e^ηz(i)−1 = exp(ηz_i−1), ∀i∈[d],∀z∈R^d. Therefore, for everyt∈ {1, . . . , T −1}

∇R^∗(y_t+1)(i) = exp(η(y_t+1(i))−1) = exp(1 + ln(x_t(i))−ηg_t(i)−1) = (x_t(i))e^−ηg^t⁽ⁱ⁾, ∀i∈[d].

Sincext+1= Π^R_∆

d(∇R^∗(yt+1)), and since by Proposition 3.11.5 the Bregman Projector on∆dw.r.t.R boils down to a normalization w.r.t. the`1-norm, we are done.

The update rules derived by the mirror maps in the above propositions are simple to implement and well-studied in the optimization literature. Thus, one may hope to unify convergence proofs of many existing algorithms with a general convergence proof of Online Mirror Descent. Indeed, Adaptive OMD happens to have many interesting connections with proximal operators and with Adaptive FTRL. The connections with the former allows us to write the general update rule from AdaOMD in a closed formula which sheds light on its inner workings. The connection with Adaptive FTRL gives yet another perspective on how AdaOMD works, and yields regret bounds through the use of the Strong FTRL Lemma.

Lemma 3.9.14 that the above set is indeed a singleton. The nameproximal comes from the fact that if we usef :=δ(· |X) for some closed setX ⊆E, the point given by the proximal operator forx¯∈E is exactly (one of) the closest point(s) to x¯ inX (w.r.t. the `2-norm).

If we look more carefully at (5.5), we can see a clear intuition of the inner workings ofprox_f(¯x): it tries to find a point that approximately minimizesf without going too far away fromx. Interestingly,¯ if we want to find a point which is allowed to be closer (or farther) from x¯ when compared to prox_f(¯x) while still approximately minimizing f, it suffices to apply the proximal operator to a positive multiple off. To see this, note that for any λ∈R++ andx¯∈Ewe have

{prox_λf(¯x)}= arg min

x∈E

λf(x) +1

2kx−xk¯ ²₂

= arg min

x∈E

f(x) + 1

2λkx−xk¯ ²₂ .

Thus, ifλ∈R++is bigger than1, the pointprox_λf(¯x)is allowed to be farther from x¯thanprox_f(¯x).

In the same way, if λ∈R++ is smaller than1, prox_λf(¯x)is probably closer to x¯thanprox_f(¯x). The intuition on proximal operator we have discussed is close to the problem of minimizingf. Thus, comes as no surprise that some of the proximal operator’s properties are closely related to the the minimizers off onE(if any). Indeed, the following proposition⁴ shows one (of the many) interesting properties of the proximal map. It shows that the fixed points ofprox_f are exactly the minimizers of f.

Proposition 5.3.1. Letf:E→(−∞,+∞]be a proper closed convex function and letx¯∈E. Then prox_f(¯x) = ¯x if and only ifx¯ attainsinfx∈Ef(x).

Proof. Suppose x¯ attainsinfx∈Ef(x). Then, for everyx∈E,

f(x) +¹₂kx−xk¯ ²₂ ≥f(x)≥f(¯x) =f(¯x) +¹₂k¯x−xk¯ ²₂.

Suppose now that prox_f(¯x) = ¯x and define h(x) := f(x) + ¹₂kx−xk¯ ²₂ for everyx ∈E. This way, we have that x¯ attain infx∈Eh(x). Note thatx ∈E7→ ¹₂kx−xk¯ ²₂ is a differentiable function with gradient x−x¯ at x∈E. Thus, by Theorems 3.5.4 and 3.5.5, for any x∈Ewe have

∂h(x) =∂f(x) +x−x.¯

By the definition of subgradient, we have that x¯ attains infx∈Eh(x) if and only if 0 ∈ ∂h(¯x) =

∂f(¯x) + ¯x−x¯=∂f(¯x), that is, x¯attains infx∈Ef(x).

Since the idea behindprox_f is to (at least approximately) minimizef, it is natural to ask whether it can be applied to optimization. However, the proximal map itself is an optimization problem, and if it is not efficiently solvable, it is of no help in optimizingf efficiently. Luckily, in many applications of proximal operators there are efficient ways to compute this operator. One example is when the functionf to be minimized can be written as a sum of two functions, one differentiable but “hard”

to optimize over, and one non-smooth function but easier to handle in the proximal map. In this case, one can handle the smooth function with a gradient step, the non-smooth part one handles by computing the proximal map at the current iterate. This line of thought is not our focus here, but one can find more information and related work in [57].

Another idea to make the proximal map prox_f for a proper convex functionf:E→(−∞,+∞]

on a given pointx¯∈Euseful in optimizing f even when we cannot computeprox_f efficiently is to substitute f by an approximationf˜of f which is easier to handle. Iff is subdifferentiable atx¯and

4This proposition is not used in the remainder of the text, but it aids to build intuition about proximal operators.

g∈∂f(¯x), then an intuitive approximation candidate is given byf˜(x) :=f(¯x) +hg, x−xi¯ for every x∈E. With this function f, we have˜

{prox_f_˜(¯x)}= arg min

x∈E

hg, xi+¹₂kx−xk¯ ²₂ .

By the definition of subgradient,0must be a subgradient atproxf˜(¯x)of the function being minimized above. Since x ∈ E → hg, xi + ¹₂kx−xk¯ ²₂ is a proper closed convex everywhere differentiable function, by Theorem 3.5.5 we know that its only subgradient is its gradient. Thus, the gradient of x∈E→ hg, xi+¹₂kx−xk¯ ²₂ at proxf˜(¯x)is 0, that is,

0 =g+ proxf˜(¯x)−x¯ ⇐⇒ proxf˜(¯x) = ¯x−g.

Therefore, the proximal map in this case is just a step in the direction of minus subgradient! Here we used a step size of 1for the sake of simplicity, but as we have discussed, one can just scalef˜to get a subgradient descent with an arbitrary positive step size.

In the above discussion, we have seen how to view a subgradient step from a point x for the functionf as the application of a proximal map based on a subgradient of f at x. Moreover, in the previous section we have seen that the online subgradient descent algorithm is just a special case of online mirror descent. Thus, one may ask if there is any generalized proximal operator which generalizes many algorithms in a way similar to the way OMD generalizes many Online Convex Optimization algorithms. Indeed, the first appearance of the proximal operator, which is exactly the one we defined asprox_f, uses the squared `2-norm as a way to measure distances. One possible generalization replaces this norm with a distance-like function D: E×E → (−∞,+∞]. In this way, the form of this generalized proximal operator when applied to a linear function⁵ hg,·i at a pointx¯∈Eis

arg min

x∈E

{hg, xi+D(x,x)}.¯

The next theorem shows that each iterate of the Adaptive OMD is given by a formula as the one above, with the minimization taking place in the setX instead of the whole space. Additionally, the distance-like functions used in the theorem seem incredibly intuitive given what we have already seen about OMD: it is the Bregman divergence w.r.t. the mirror map at the current round. Moreover, the proof of the next theorem follows the lines of the discussion we had above about the subgradient step, that is, we just look at the optimality conditions of the generalized proximal step, and conclude that the generalized proximal is exactly the Adaptive OMD step.

Theorem 5.3.2. Let C := (X,F) be an OCO instance such that X is closed and such that each f ∈ F is a proper closed function which is subdifferentiable onX. LetR: Seq(F)→(−∞,+∞]^E be a mirror map strategy forC, letENEMY be an enemy oracle forC, and letT ∈N. Define

(x,f) := OCOC(AdaOMD^X_R,ENEMY, T) and R_t:=

i=1

R(hf₁, . . . , ft−1i), for eacht∈[T].

Moreover, letg_t∈∂f_t(x_t) be the same as in the definition ofAdaOMD^X_R(f) on Algorithm 5.1 for each t∈[T]. Set {x⁰₁}:= arg min_x∈XR1(x) and define

{x⁰_t+1}:= arg min

x∈X

hg_t, xi+BRt+1(x, x⁰_t)

, ∀t∈ {1, . . . , T −1}. (5.6) Then x_t=x⁰_t for eacht∈[T].

5Recall that we saw in the discussion how the proximal applied to thelinearized functionyields the subgradient step, not the proximal operator applied to the function itself.

Proof. Let us prove the statement by induction ont∈[T]. For t= 1 the statement holds by the definitions ofx₁ and x⁰₁. Let t∈ {1, . . . , T −1}and suppose x_t=x⁰_t. Define

y_t+1:=∇R_t+1(x_t)−g_t=∇R_t+1(x⁰_t)−g_t.

By the definition of AdaOMDR on Algorithm 5.1, we have that xt+1 = Π^R_X^t+1(∇R^∗_t+1(yt+1)).

Proposition 5.1.3 states that, since R_t+1 is a mirror map forX, thenR^∗_t+1 is differentiable onEand Rt+1 is differentiable on∇R^∗_t+1(yt+1)∈int(domRt+1). Thus, since(int(domRt+1))∩ri(X)6=∅by the definition of mirror map, Lemma 3.11.4 yields

xt+1 = Π^R_X^t+1(∇R^∗_t+1(yt+1)) ⇐⇒ yt+1− ∇R_t+1(xt+1)∈NX(xt+1). (5.7) Note that

y_t+1− ∇R_t+1(x_t+1) =∇R_t+1(x⁰_t)−g_t− ∇R_t+1(x_t+1) =−g_t−(∇B_R_t+1(·, x⁰_t))(x_t+1), and the latter is minus the gradient ofx∈E7→g^T_tx+B_R_t+1(x, x⁰_t) atxt+1. Therefore, using (5.7) with the above equation and by the optimality conditions from Theorem 3.6.2,

x_t+1 = Π^R_X^t+1(∇R^∗_t+1(y_t+1)) ⇐⇒ −(g_t+ (∇B_R_t+1(·, x⁰_t))(x_t+1))∈N_X(x_t+1)

⇐⇒ xt+1 ∈arg min

x∈X

hg_t, xi+BRt+1(x, x⁰_t) .

This connection between Adaptive OMD and generalized proximal operators gives us yet another intuitive view of online mirror descent. At round t, the algorithm minimizes, as much as possible, the value of the linearized version of the functionft−1 played by enemy on the last round, but tries to not pick a point too far from the last iterate w.r.t. the Bregman divergence based on the current mirror map. In the next section, we will see that the Adaptive OMD can be seen as an application of the Adaptive FTRL algorithm.

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 120-123)