Nonlinear Separators, Support Vector Machines, and Ker- Ker-nelsKer-nels

It may happen that there are linear separators for which almost all but a small fraction of examples are on the correct side. Going back to (6.2), ask if there is aw for which at least (1−ε)n of then inequalities in (6.2) are satisfied. Unfortunately, such problems are NP-hard and there are no good algorithms to solve them. A good way to think about this is that we suffer a “loss” of one for each misclassified point and would like to minimize the loss. But this loss function is discontinuous, it goes from 0 to 1 abruptly. However, with a nicer loss function it is possible to solve the problem. One possibility is to introduce slack variablesyi,i= 1,2, . . . , n, where yi measures how badly the exampleai is classified. We then include the slack variables in the objective function to be minimized:

minimize |v|²+c

i=1

y_i

subject to (v·ai)li ≥1−yi

y_i ≥0.

i= 1,2, . . . , n (6.3) If for somei, l_i(v·a_i)≥1, then sety_i to its lowest value, namely zero, since eachy_i has a positive coefficient in the cost function. If, however,l_i(v·a_i)<1, then sety_i = 1−l_i(v·a_i), so y_i is just the amount of violation of this inequality. Thus, the objective function is trying to minimize a combination of the total violation as well as 1/margin. It is easy to see that this is the same as minimizing

|v|²+cX

(1−l_i(v·a_i))⁺,⁹

subject to the constraints. The second term is the sum of the violations.

6.3 Nonlinear Separators, Support Vector Machines, and

+1 +1 +1

-1 -1

Figure 6.4: The checker board pattern.

polynomial. Note that each d-tuple of integers (i₁, i₂, . . . , i_d) with i₁+i₂+· · ·+i_d ≤ D leads to a distinct monomial, xⁱ₁¹xⁱ₂²· · ·xⁱ_d^d. So, the number of monomials in the polyno-mialpis at most the number of ways of insertingd-1 dividers into a sequence ofD+d−1 positions which is ^D+d−1_d−1

≤(D+d−1)^d−1. Letm = (D+d−1)^d−1 be the upper bound on the number of monomials.

By letting the coefficients of the monomials be unknowns, we can formulate a linear program in m variables whose solution gives the required polynomial. Indeed, suppose the polynomial pis

p(x₁, x₂, . . . , x_d) = X

i1,i2,...,id

i1+i2+···+i_d≤D

w_i₁_,i₂_,...,i_dxⁱ₁¹xⁱ₂²· · ·xⁱ_d^d.

Then the statement p(a_i) > 0 (recall a_i is a d-vector) is just a linear inequality in the w_i₁_,i₂_,...,i_d. One may try to solve such a linear program. But the exponential number of variables for even moderate values of D makes this approach infeasible. Neverthe-less, this theoretical approach is useful. First, we clarify the discussion above with an example. Suppose d = 2 and D = 2. Then the possible (i₁, i₂) form the set {(1,0),(0,1),(2,0),(1,1),(0,2)}. We ought to include the pair (0,0); but it is con-venient to have a separate constant term calledb. So we write

p(x1, x2) =b+w10x1 +w01x2+w11x1x2+w20x²₁+w02x²₂. Each example, a_i,is a 2-vector, (a_i1, a_i2). The linear program is

b+w_1,0a_i1+w₀₁a_i2+w₁₁a_i1a_i2+w₂₀a²_i1 +w₀₂a²_i2 >0 if label ofi= +1 b+w₁₀a_i1+w₀₁a_i2+w₁₁a_i1a_i2 +w₂₀a²_i1+w₀₂a²_i2 <0 if label of i=−1.

Note that we “pre-compute” a_i1a_i2, so this does not cause a nonlinearity. The linear inequalities have unknowns that are thew’s and b.

The approach above can be thought of as embedding the examples a_i that are in d-space into a m-dimensional space where there is one coordinate for each i1, i2, . . . , id

summing to at mostD, except for (0,0, . . . ,0), and ifa_i = (x₁, x₂, . . . , x_d), the coordinate is xⁱ₁¹xⁱ₂²· · ·xⁱ_d^d. Call this embedding ϕ(x). When d = D = 2, as in the above example, ϕ(x) = (x₁, x₂, x²₁, x₁x₂, x²₂). If d= 3 andD= 2,

ϕ(x) = (x₁, x₂, x₃, x₁x₂, x₁x₃, x₂x₃, x²₁, x²₂, x²₃),

and so on. We then try to find a m-dimensional vector w such that the dot product of wand ϕ(a_i) is positive if the label is +1 and negative otherwise. Note that thiswis not necessarily theϕ of some vector in d space.

Instead of finding any w, we want to find the w maximizing the margin. As earlier, write this program as

min|w|² subject to (w·ϕ(a_i))l_i ≥1 for all i.

The major question is whether we can avoid having to explicitly compute the embedding ϕ and the vector w. Indeed, we only need to haveϕ and w implicitly. This is based on the simple, but crucial observation that any optimal solution w to the convex program above is a linear combination of the ϕ(ai). If w = P

yiϕ(ai), then w^Tϕ(aj) can be computed without actually knowing theϕ(ai) but only the products ϕ(ai)·ϕ(aj).

Lemma 6.2 Any optimal solutionw to the convex program above is a linear combination of the ϕ(a_i).

Proof: If w has a component perpendicular to all the ϕ(a_i), simply zero out that com-ponent. This preserves all the inequalities since thew·ϕ(a_i) do not change and decreases

|w|² contradicting the assumption that w is an optimal solution.

Assume that w is a linear combination of the ϕ(a_i). Say w= P

y_iϕ(a_i), where the y_i are real variables. Note that then

|w|² = X

y_iϕ(a_i)

· X

y_jϕ(a_j)

i,j

y_iy_jϕ(a_i)·ϕ(a_j).

Reformulate the convex program as minimize X

i,j

y_iy_jϕ(a_i)·ϕ(a_j) (6.4)

subject tol_i X

y_jϕ(a_j)·ϕ(a_i)

≥1 ∀i. (6.5)

It is important to notice that ϕ itself is not needed, only the dot products of ϕ(a_i) and ϕ(a_j) for all i and j including i=j. The Kernel matrix K defined as

kij =ϕ(ai)·ϕ(aj),

suffices since we can rewrite the convex program as minimizeX

y_iy_jk_ij subject to l_iX

k_ijy_j ≥1. (6.6) This convex program is called asupport vector machine (SVM) though it is really not a machine. The advantage is thatK has onlyn² entries instead of theO(d^D) entries in each ϕ(a_i). Instead of specifyingϕ(a_i), we specify how to get K from thea_i. The specification is usually in closed form. For example, the “Gaussian kernel” is given by:

k_ij =ϕ(a_i)·ϕ(a_j) =e^−c|aⁱ^−a^j^|². We prove shortly that this is indeed a kernel function.

First, an important question arises. Given a matrix K, such as the above matrix for the Gaussian kernel, how do we know that it arises from an embeddingϕas the pairwise dot products of the ϕ(a_i)? This is answered in the following lemma.

Lemma 6.3 A matrix K is a kernel matrix, i.e., there is an embedding ϕ such that k_ij =ϕ(a_i)·ϕ(a_j), if and only if K is positive semidefinite.

Proof: IfK is positive semidefinite, then it can be expressed as K =BB^T. Defineϕ(a_i) to be the i^th row of B. Then kij = ϕ(ai)^Tϕ(aj). Conversely, if there is an embedding ϕ such that k_ij = ϕ(a_i)^Tϕ(a_j), then using the ϕ(a_i) for the rows of a matrix B, we have that K =BB^T and soK is positive semidefinite.

Recall that a function of the form P

yiyjkij = y^TKy is convex if and only if K is positive semidefinite. So the support vector machine problem is a convex program. We may use any positive semidefinite matrix as our kernel matrix.

We now give an important example of a kernel matrix. Consider a set of vectors a₁,a₂, . . . and letk_ij = (a_i·a_j)^p, where p is a positive integer. We prove that the matrix K with elements k_ij is positive semidefinite. Suppose u is any n-vector. We must show that u^TKu=P

k_iju_iu_j ≥0.

kijuiuj =X

uiuj(ai·aj)^p

u_iu_j X

a_ika_jk

u_iu_j



 X

k1,k2,...,kp

a_ik₁a_ik₂· · ·a_ik_pa_jk₁· · ·a_jk_p



 by expansion¹¹.

Note thatk₁, k₂, . . . , k_pneed not be distinct. Exchanging the summations and simplifying X

u_iu_j



 X

k1,k2,...,kp

a_ik₁a_ik₂· · ·a_ik_pa_jk₁· · ·a_jk_p



= X

k1,k2,...,kp

u_iu_ja_ik₁a_ik₂· · ·a_ik_pa_jk₁· · ·a_jk_p

= X

k1,k2,...,kp

u_ia_ik₁a_ik₂· · ·a_ik_p

. The last term is a sum of squares and thus nonnegative proving thatK is positive semidef-inite.

From this, it is easy to see that for any set of vectors a₁,a₂, . . . and any c₁, c₂, . . . greater than or equal to zero, the matrixK wherek_ij has an absolutely convergent power series expansionkij =

∞

p=0

cp(ai·aj)^p is positive semidefinite. For any u,

u^TKu=X

u_ik_iju_j =X

u_i

∞

p=0

c_p(a_i·a_j)^p

! u_j =

∞

p=0

c_pX

i,j

u_iu_j(a_i·a_j)^p ≥0.

Lemma 6.4 For any set of vectors a₁,a₂, . . ., the matrix K given by k_ij =e^−|aⁱ^−a^j^|²^/(2σ²⁾ is positive semidefinite for any value ofσ.

Proof:

e^−|aⁱ^−a^j^|²^/2σ² =e^−|aⁱ^|²^/2σ²e^−|a^j^|²^/2σ²e^aⁱ^T^a^j^/σ² =

e^−|aⁱ^|²^/2σ²e^−|a^j^|²^/2σ² X^∞

t=0

(a_i^Ta_j)^t t!σ^2t

The matrixL given by l_ij =

∞

t=0

_(a

iTaj)^t t!σ^2t

is positive semidefinite. Now K can be written as DLD^T, where D is the diagonal matrix with e^−|aⁱ^|²^/2σ² as its (i, i)^th entry. So K is positive semidefinite.

Example: (Use of the Gaussian Kernel) Consider a situation where examples are points in the plane on two juxtaposed curves, the solid curve and the dotted curve shown in Figure 6.5, where points on the first curve are labeled +1 and points on the second curve are labeled -1. Suppose examples are spacedδapart on each curve and the minimum distance between the two curves is ∆ >> δ. Clearly, there is no half-space in the plane that classifies the examples correctly. Since the curves intertwine a lot, intuitively, any polynomial which classifies them correctly must be of high degree. Consider the Gaussian kernel e^−|aⁱ^−a^j^|²^/δ². For this kernel, the K has kij ≈ 1 for adjacent points on the same

11Here a_ikdenotes thek^th coordinate of the vectora_i

Figure 6.5: Two curves

curve and kij ≈ 0 for all other pairs of points. Reorder the examples, first listing in order all examples on the solid curve, then on the dotted curve. K has the block form:

K =

K₁ 0 0 K₂

, where K₁ and K₂ are both roughly the same size and are both block matrices with 1’s on the diagonal and slightly smaller constants on the diagonals one off from the main diagonal and then exponentially falling off with distance from the diagonal.

The SVM is easily seen to be essentially of the form:

minimize y₁^TK₁y₁+y₂^TK₂y₂

subject to K₁y₁ ≥1 and K₂y₂ ≤ −1.

This separates into two programs, one for y₁ and the other for y₂. From the fact that K₁ =K₂, the solution will have y₂ =−y₁. Further by the structure which is essentially the same everywhere except at the ends of the curves, the entries iny₁will all be essentially the same as will the entries in y₂. Thus, the entries in y₁ will be 1 everywhere and the entries in y₂ will be -1 everywhere. Letl_i be the±1 labels for the points. The y_i values provide a nice simple classifier: l_iy_i >1.

No documento Foundations of Data Science (páginas 188-193)