Distance-based parametrisation for metric learning

4.3 Tag relevance prediction models

4.3.3 Distance-based parametrisation for metric learning

The other possibility for defining the weights is to express the weights directly as a function of the distance, rather than the rank. This has the advantage that weights will depend smoothly on the distance, which is crucial if the distance is to be adjusted during training.

The weights of training images jfor an imagei are therefore redefined as πi j= exp(−d_θ(i,j))

j⁰exp(−d_θ(i,j⁰)), (4.30) whered_θ is a distance metric with parametersθ that we want to optimise. Note that the weightsπi j decay exponentially with distance d_θ to imagei.

Choices ford_θ include – and are not limited to – the following:

(a) A fixed distanced with a positive scale factor: d_λ(i,j) =λd_{i j},

(b) A positive combination of several base distances: d_w(i,j) =w^>d_{i j}, where d_{i j} is a vector of base distances between image i and j, and w contains the positive coefficients of the linear distance combination,

d_M(i,j) = (xi−x_j)^>M(xi−x_j), (4.31) wherex_i andx_j are feature vectors for imagesi and j, respectively.

The Mahalanobis distance is the most general case of the three options above: if the base distances are Euclidean distances, then the single distance case corresponds to M=λI, and the case of multiple base distances can be written using a block-scalar matrixMusing the components of won the diagonal.

Let us first focus on the simple case (a). In this setting, there is only one parameter λ, that controls the decay of the weights with the distance. The gradient of L with respect toλequals:

∂L

∂ λ =X

c_y

p(y_iw) X

d_{i j}πi jp(y_iw)−d_{i j}πi jp(y_iw = +1|j)

, (4.32)

i j

C_i

πi j−ρi j

d_{i j}, (4.33)

where C_i = P

wc_y_iw and ρi j denotes the weighted average over all words w of the posterior probability of neighbour jfor imagei given the annotation:

ρi j=X

c_y

C_i

πi jp(y_iw|j) p(y_iw) =X

c_y

C_i p(j|y_iw). (4.34) We refer to this variant as DT, for “distance-based”.

In the second option (b), the number of parameters equals the number of base distances that are combined. This is a straightforward extension to DT, and the gradient is written by replacing the d_{i j} term in Equation 4.33 with the vectord_{i j} of base distances:

∇L(w) =X

i,j

C_i(πi j−ρi j)d_{i j}. (4.35)

We refer to this variant as MD, for “multiple distances”. Equivalently, DT is the special case of MD when there is only one base distance.

Finally, using option (c), we can decomposed_M(i,j)as in Equation 4.36 (and already given in Equation 2.2). This shows that the distance can be written as a linear combination of the components of (xi−x_j)(xi −x_j)^>, which can be seen as individual distance values. Therefore we can re-use Equation 4.35 to obtain the gradient of the likelihood in Equation 4.37:

d_M(x_i,x_j) =

k=1 D

l=1

M_kl(x_ik−x_jk)(x_il−x_jl) (4.36)

∇L(M) =X

i,j

C_i(πi j−ρi j)(xi−x_j)(xi−x_j)^>. (4.37)

Using similar derivations than for LDML in Equation 2.31 to Equation 2.36, we can write the gradient in the following form:

∇L(M) =X(H+H^>)X^>, (4.38) where

X= [xi]∈R^D^×^N, (4.39)

h_ii=X

C_i(ρi j−πi j), (4.40)

h_{i j}=C_i(πi j−ρi j) fori6= j, (4.41)

H= [h_{i j}]∈R^N×N. (4.42)

4.3. TAG RELEVANCE PREDICTION MODELS 113

Again, as shown for LDML in Section 2.3.1, we can replace the optimisation ofMwith the optimisation of a projection matrixUsuch thatM=U^>U.

∇L(U) =2UX(H+H^>)X^>. (4.43) This expression is promising for learning compact image representations for optimal tag prediction using the Euclidean distance in our weighted nearest neighbour model, using a low-rank matrixU∈R^d^×^D. Notably, most indexing and approximate nearest neighbour techniques have been proposed for Euclidean spaces. Equation 4.43 also shows that, similarly to LDML and many other Mahalanobis metric learning algo- rithms, TagProp for learning a Mahalanobis distance is kernelisable.

In the following, we focus on the DT and MD variants. The first reason is that these variants have only a limited number of parameters, one for each base distance. In our experiments, we have M =15 image representations which account for a total of D=32752 dimensions. The number of parameters would therefore be very large (∼5·10⁸) in the case of general Mahalanobis distance. Since our model is a non- linear nearest neighbour classifier, it is already very flexible in its predictions with only a few parameters. Second, computations can be performed very efficiently in the case of MD, by pre-computing theM pairwise distance matrices.

To further reduce the computational cost of training the model, we do not compute all pairwiseπi j andρi j. Rather, for eachi we compute them only over a large set ofK neighbours, and assume the remainingπi j andρi j to be zero. When only one distance is used, we simply select the K nearest neighbours, since they will not change with the scaling parameter.

When learning a linear combination of several distances it is not clear beforehand which will be the nearest neighbours, as the distance measure changes during learning. Given that we will use K neighbours, we therefore include as many neighbours as possible from each base distance so as to maximise the chance to include all images with largeπi j regardless of the distance combinationwthat is learnt. After determin- ing these neighbourhoods, our algorithm scales linearly with the number of training images.

For each image, we select the K neighbours the following way. Let us denote with N^d_k the neighbourhood of size k for distance d. Since the neighbourhoods of size k for the different distances are not disjoint, the union of those will typically have less than M ×k elements. We therefore try to find k such that the union of the M neighbourhoods of sizekhas cardinalityK:

k^?=argmin

(

[

N^d_k ≥K

)

. (4.44)

Mdistances

K neighbours n¹₁

n²₁ n³₁ n⁴₁ n⁵₁

n¹₂ n²₂ n³₂ n⁴₂ n⁵₂

n¹₃ n²₃ n³₃ n⁴₃ n⁵₃

n¹₄ n²₄ n³₄ n⁴₄ n⁵₄

n¹₅ n²₅ n³₅ n⁴₅ n⁵₅

n¹₆ n²₆ n³₆ n⁴₆ n⁵₆

n¹₇ n²₇ n³₇ n⁴₇ n⁵₇

n¹₈ n²₈ n³₈ n⁴₈ n⁵₈

n¹₉ n²₉ n³₉ n⁴₉ n⁵₉

n¹₁₀ n²₁₀ n³₁₀ n⁴₁₀ n⁵₁₀

Figure 4.3: Illustration of theM×Knearest neighbour matrix of an image, contain- ing at cell(d,k)the indexn^d_k of thek-th nearest neighbour for distanced. To selectK neighbours for our tag prediction algorithm, we propose to select the K first unique values found in this matrix when running through it column-wise.

Alternatively, k^? can be understood as the maximum k = min{k_d}, where k_d is the largest neighbour rank for which neighbours 1 to k of base distance d are included among the selected neighbours.

For a given data point and precomputed neighbours for each base distance, there is an efficient algorithm to perform such a selection in linear time. The basic idea is to go through the neighbourhoods, rank by rank, while keeping track of which neighbours are already selected, untilK unique neighbours are found. For that, it is sufficient to pre-compute the K neighbours of each of the M distances in a M ×K matrix and to go through it column-wise, as illustrated in Figure 4.3. In practice we use a variation of this idea that also keeps track of the lowest rank (among the distances) at which neighbours are found, such that it is easy to process distances one after the other in an online fashion, with the same overall complexity.

Finally, note the relation of our model to the multi-class metric learning approach of Globerson and Roweis[2005]. In that work, a metric is learnt such that weightsπi j as defined by Equation 4.30 are as close as possible in the sense of Kullback-Leibler (KL) divergence to fixed set of target weightsρi j. The target weights were defined to be zero for pairs from different classes, and set to a constant for all pairs from the same class. In fact, when deriving an EM-algorithm for our model, we find the objective of the M-step to be of the form of a KL divergence between the ρi j (fixed to values computed in the E-step) and theπi j. For fixedρi j this KL divergence is convex inw.

4.3. TAG RELEVANCE PREDICTION MODELS 115

Figure 4.4: Illustration of the potential weakness in weighted nearest neighbour models. The star represents the image that is to be annotated, and is surrounded by images tagged with a red square and others tagged by a blue circle. Even if most of the blue circles are included in the neighbourhood (represented as a circle centred on the star) of the target image, its prediction will always be lower than the one for the red squares, since they are densely present in the entire space.

No documento Données multimodales pour l’analyse d’image Matthieu Guillaumin (páginas 122-126)