Norms - Online Convex Optimization: Algorithms, Learning, and Duality

On the Euclidean space E, norms are functions which assign non-negative “lengths” or “sizes” to each point inE, assigning length zero only to the zero vector. In the case of R^d, one is most used with the euclidean norm

x∈R^d7→X^d

i=1

x²_i¹₂ .

In this text we will be interested in cases where we use a bit less standard norms. Thus, let us define what properties a function needs to satisfy to be a norm and then let us see some concepts and results related to convex analysis duality applied to norms.

Let k·k:E→R. We say that k·kis a normon Eif, for all u, v∈E, (i) kvk ≥0, and equality holds if and only ifv= 0,

(ii) kαvk=|α|kvk for every α∈R,

(iii) ku+vk ≤ kuk+kvk, also known astriangle inequality.

If k·k satisfies only conditions (ii) and (iii) andkvk ≥ 0 for any v ∈E, then k·k is a semi-norm onE. Condition (ii) and the non-negativity imply that semi-norms and norms are convex functions.

Moreover, one may verify that ifk·k is a norm onE, then it is a continuous³ function on E.

3The definition of continuous function itself (at least at first sight) depends on a norm, so say that all norms are continuous without defining continuity is a somewhat circular statement. Thus, we use the following definition of continuity: a functionf:E→[−∞,+∞]iscontinuousat a pointx¯∈Eif for everyε >0there isδ >0such that, for anyx∈Ewithh¯x−x,x¯−xi ≤δ, we have|f(¯x)−f(x)| ≤ε. That is, we use the (squared) euclidean norm inE to define continuity.

Theeuclidean normor`₂-normonEis the normk·k₂given bykxk₂ :=p

hx, xifor everyx∈E. Additionally, throughout the text we may use some special and known norms for R^d:

• the`1-normgiven by kxk₁ :=Pd

i=1|x_i|for every x∈R^d,

• the`∞-norm given bykxk_∞:= max_i∈[d]|x_i|for every x∈R^d,

• the`p-norm forp∈(1,∞) given by kxk_p :=X^d

i=1

|x_i|^p¹

p, ∀x∈R^d.

As we have already said, our main focus in this chapter is to define and gain intuition about duality relations and concepts in convex analysis. One very interesting dual object related to norms are the dual norms. If k·kis a norm on E, thedual norm ofk·k is the norm⁴ k·k_∗ on Edefined by

kx^∗k_∗ := max{ hx^∗, xi:x∈E,kxk ≤1}, ∀x^∗ ∈E.

At first sight, the definition of the dual norm k·k∗ of a normk·k on Ehardly has any intuitive meaning. There are two ways of looking atk·k_∗ which may be helpful in gaining some intuition. One way is to see dual norms as special cases of support functions. Set Bk·k :={x∈E:kxk ≤1}, that is, Bk·k is theunit ball (w.r.t. the norm k·k). Then, by the definition of conjugate function we have

kx^∗k_∗=δ^∗(x^∗|Bk·k), ∀x^∗∈E.

This way of seeing the dual norm as the support function of the unit ball of the original norm may be useful in some cases, but it is arguably yet too abstract. A more concrete way of seeing norms is as norms in the space of linear functionals onE, that is, linear functions fromE toR. Let x^∗ ∈E and fix a normk·kon E. The pointx^∗ can be seen as representing the linear functionalTx^∗

given byT_x^∗(x) :=hx^∗, xi for every x∈E. Given that we have the normk·k to measure the sizes of elements inE, it would be interesting to have a related way to measure the “size” ofTx^∗. Intuitively, we want a measure such that the bigger the norm of Tx^∗(x) when compared to the norm ofx∈E, the bigger is the size of T_x^∗. That is, we want to measure how muchT_x^∗ stretches the vectors when we measure lengths withk·k. Of course, for distinct non-zero vectorsx, y∈Ethe ratiosTx^∗(x)/kxk andTx^∗(y)/kyk may differ. Thus, we measure a linear functional by the direction which it stretches the most, that is,

sup

x∈E\{0}

Tx^∗(x)

kxk = sup

x∈E\{0}

hx^∗, xi

kxk = sup

x∈E:kxk≤1

hx^∗, xi=kx^∗k_∗.

Let us look at some properties and interesting special cases of dual norms. One interesting fact that we will use repeatedly during the remainder of the text, usually without reference, is that the

`₂-norm onE isself-dual, that is, we have(k·k₂)∗ =k·k₂. Lemma 3.8.1. The dual norm of k·k₂ on Eis k·k₂.

Proof. Let k·k_2,∗ be the norm dual to k·k₂. By the Cauchy–Schwarz inequality we have, for any x^∗ ∈E\ {0},

kx^∗k_2,∗ = max{x^∗Tx:x∈E,kxk₂≤1} ≤max{ kx^∗k₂kxk₂:x∈E,kxk₂≤1}=kx^∗k₂, and since the above inequality holds as an equation for x:=kx^∗k⁻¹₂ x^∗, we have k·k_2,∗=k·k₂.

4We skip the proof that the dual norm is indeed a norm for the sake of conciseness.

As expected, we show in the next theorem that the dual norm of a dual norm is the original norm. Maybe more interestingly, the proof follows relatively easily when we use the results about Fenchel conjugates, mainly the fact that the conjugate of the conjugate of a closed function if the function itself. Since norms are continuous, and thus closed, functions, we are guaranteed that the conjugate of the conjugate of (the square of) a norm is the norm itself.

Theorem 3.8.2. Letk·kbe a norm on E. Then(¹₂k·k²)^∗= ¹₂k·k²_∗. In particular,k·k∗∗=k·k.

Proof. Letx^∗ ∈E. Note that

1 2k·k²∗

(x^∗) = sup

x∈E

hx^∗, xi −¹₂kxk²

≤sup

x∈E

kx^∗k_∗kxk −¹₂kxk²

= ¹₂kx^∗k²_∗, (3.8) where in the last inequality we used that sup_α∈_R(αkx^∗k_∗−α²/2) is attained bykx^∗k_∗. Lety¯∈E attain max{ hx^∗, xi:x∈E,kxk ≤1}=kx^∗k_∗, and setx¯:=kx^∗k_∗y. We have¯

hx^∗,xi −¯ ¹₂k¯xk² =kx^∗k∗hx^∗,yi −¯ ¹₂kx^∗k²_∗k¯yk²=kx^∗k²_∗−¹₂kx^∗k²_∗kyk¯ ² = ¹₂kx^∗k²_∗.

Hence, (3.8) holds as an equation. Finally, since ¹₂k·k² is continuous (and, thus, closed), by what we have just proved and by Theorem 3.4.2 we have

2k·k²_∗∗= (¹₂k·k²_∗)^∗= (¹₂k·k²)^∗∗= ¹₂k·k², that is,k·k∗∗=k·k.

One result that we will use extensively in this text is the fact that `1 and `∞-norms are dual to each other.

Lemma 3.8.3. The dual norm of k·k₁ on R^d isk·k_∞.

Proof. Letx^∗ ∈R^d and letx∈R^d such thatkxk₁≤1. We have (x^∗)^Tx=

i=1

x^∗_ix_i ≤

i=1

|x^∗_i||x_i| ≤ kx^∗k_∞

i=1

|x_i| ≤ kx^∗k_∞.

Since the above chain of inequalities holds as an equation forx:=|x^∗_i∗|e_i∗, wherei^∗ ∈arg max_i∈[d]|x^∗_i|, we are done.

When we start to look at regret bounds for OCO algorithms, most of them will depend on the norms of the subgradients of the functions used by the enemy. Thus, one may already imagine that if the player is able to have any control on the norm which measures the sizes of the subgradients, she could pick a norm under which the enemy functions’ subgradients have small norm. Still, to make such a choice, the players needs to have some information on the functions the enemy is allowed to pick. In optimization problems one usually assumes that the functions which one has to handle are Lipschitz continuous, that is, the functions cannot change too much between points which are close to each other (w.r.t. to some fixed norm). For differentiable functions, Lipschitz continuity means that the derivative in any direction is bounded by a constant. Interestingly, Lipschitz continuity and the (dual) norms of the subgradients of the function are deeply connected. Before proving this result, let us define Lipschitz continuity.

Let ρ >0. A functionf:E→(−∞,+∞] isρ-Lipschitz continuouson a set X⊆domf w.r.t.

a normk·kon Eif

|f(x)−f(y)| ≤ρkx−yk, ∀x, y∈X, and whenX is not explicitly stated, assume X= domf.

Theorem 3.8.4 (Based on [67, Lemma 2.6]). LetX ⊆E be a convex set with nonempty interior, and letf:E→(−∞,+∞]be a proper closed convex function which is ρ-Lipschitz continuous onX w.r.t. a norm k·kon E. Then, for every x∈X there is g∈∂f(x)such that kgk_∗ ≤ρ. Additionally, for every x∈intX we have∂f(x)⊆ {g∈E:kgk_∗ ≤ρ}.

Proof. First, let us show that

∅6=∂f(˚x)⊆ {y∈E:kyk_∗≤ρ}, ∀˚x∈intX. (3.9) Let ˚x ∈ intX and let u ∈ ∂f(˚x), which exists by Theorem 3.5.1 since X ⊆ domf and, thus, intX ⊆int(domf). Moreover, since the setB:={v∈E:kvk ≤1}is compact⁵,sup_v∈_Bhv, ui=kuk_∗ is attained. Let

y∈˚x+ arg max

v∈B

hu, vi. (3.10)

Additionally, for every λ∈[0,1]definezλ:= ˚x+λ(y−˚x). Since˚x∈intX, there is ε >0 such that zε∈X. Therefore,

εkuk_∗ ^(3.10)= εhu, y−˚xi=hu, z_ε−˚xi ≤f(z_ε)−f(˚x)≤ρkz_ε−˚xk=ρεky−˚xk^(3.10)≤ ρε, where in the first inequality we just used the subgradient inequality. Hence,kuk_∗≤ρ. This completes the proof of (3.9).

Let x¯∈X, let˚x∈intX, and define x_k:= ˚x+

1− 1 k+ 1

(¯x−˚x), ∀k∈N.

By Theorem 3.2.1, we have x_k ∈intX for every k∈ N. Thus, by (3.9), for every k ∈N there is uk ∈∂f(xk)withku_kk∗ ≤ρ. That is,{u_k}_k∈_Nis a bounded sequence and therefore it has a convergent subsequence. Namely, there is an increasing injection π:N → Nsuch that limk→∞u_π(k) = ¯u for some u¯∈Ewith k¯uk_∗ ≤ρ. Moreover, sincef is closed, by Theorem 3.2.6 we have

k→∞lim f(x_π(k)) =f(¯x). (3.11)

Finally, by the subgradient inequality we have, for every k∈Nand z∈N, f(z)≥f(x_π(k)) +hu_π(k), z−x_π(k)i.

Taking the limit for k tending to+∞ in the above inequality together with (3.11) (and since the inner product= is a continuous function) yields

f(z)≥f(¯x) +hu, z¯ −xi,¯ ∀z∈E, that is,u¯∈∂f(¯x).

On Chapter 6, we will look at (semi-)norms on R^d which have a special form: they are the

`₂-norm skewed by a positive semi-definite matrix A∈S^d+. Formally, for every A∈S^d+, define the (semi-)norm induced by Aby

kxk_A:=

√

x^TAx, ∀x∈R^d.

The next lemma shows us that the functions that we have just defined are indeed semi-norms.

Moreover, it shows that in the case of positive definite matrices, the above functions are indeed norms and that the dual norm of a norm induced byA∈S^d++ is the norm induced by the inverse matrix A⁻¹.

5Recall that any norm inEis a continuous function.

Lemma 3.8.5. LetA∈S^d+. Then k·k_A is a semi-norm, and ifA0, thenk·k_A is actually a norm whose dual norm isk·k_A−1.

Proof. Since A 0, we have that by Proposition 1.1.4 A^1/2 ∈ S^d+ exists and is unique. Thus, kvk_A = kA^1/2vk₂ for any v ∈ R^d. Since v ∈ R^d 7→ A^1/2v is a linear function, it is clear that v∈R^d7→ kA^1/2vk₂ is non-negative everywhere on R^d and that it satisfies properties (iii) and (ii) from the definition of norm. SupposeA0. ThenA^1/2 is invertible, and for any v∈R^d we have A^1/2v = 0 if and only if v = A^−1/20 = 0. Thus, in this case, k·k_A is a norm on R^d. Finally, by Proposition 1.1.4 we have (A⁻¹)^1/2 = A^−1/2 = (A^1/2)⁻¹. With this, note that for every x ∈ R^d there is y := A^1/2x ∈ R^d such that x = A^−1/2y. Thus, R^d = {A^−1/2y:y∈R^d}. Therefore, by Theorem 3.8.2 and since the `₂-norm is dual to itself we have, for any x^∗ ∈R^d,

(¹₂k·k²_A)^∗(x^∗) = sup

x∈R^d

((x^∗)^Tx−¹₂kxk²_A) = sup

y∈R^d

(x^∗)^TA^−1/2y−¹₂kA^−1/2yk²_A

= sup

y∈R^d

(A^−1/2x^∗)^Ty−¹₂y^T(A^−1/2AA^−1/2)y

= sup

y∈R^d

(A^−1/2x^∗)^Ty−¹₂y^Ty

= sup

y∈R^d

A^−1/2x^∗T

y−¹₂kyk²₂

Thm3.8.2

= kA^−1/2x^∗k²₂=kx^∗k²_A−1. Thus, by Theorem 3.8.2 we conclude that the norm dual to k·k_A isk·k_A⁻¹.

Finally, at some points of the text we shall need some norms for the space of real square matrices.

One of the better-known norms for matrices are the operator norms, which are based on norms forR^d. In this text we shall restrict our attention only to the operator norm induced by the`2-norm.

Formally, the operator norm(induced by the `₂-norm) ofA∈R^d×d is kAk₂ := max{ kAxk₂:x∈R^d,kxk₂≤1}.

The next lemma shows useful connections between the operator norm of a matrix and its eigenvalues.

We skip its proof for the sake of conciseness.

Lemma 3.8.6 ([39, Example 5.6.6]). IfA∈S^d, thenkAk₂ = max{|λ^↑₁(A)|,|λ^↑_d(A)|}. In particular, if A0, thenkAk₂≤Tr(A).

No documento Online Convex Optimization: Algorithms, Learning, and Duality (páginas 64-68)