Skip-Gram - PhD Dissertation - Department of Mathematics

This relation holds for any givenδ^[s]_j for 1≤s≤Kby iteration. The procedure of calculating theδthrough the previously calculatedδs is calledback-propagation.

The take-away is that we may computeδ_j^[s]through the previously calculated δ’s (by iterating this procedure backwards) and obtain the desired derivatives for gradient descent through the formula

∂L

∂w^[s]_ij = ∂L

∂h^[s]_i

∂w^[s]_ij =δ^[s]_i z_j^[s⁻^1], wherew^[s]_ij denotes the (i, j)’th entry ofW^[s]in thesth layer.

We will call the inputs items (x, y) for a pair, as it corresponds to an item x and an itemy in the context of x. Creating the pairs is straightforward, disregarding the corner cases of starting and endpoints of the sequence and is illustrated in Algorithm 1. For the beginning and end of the sequence, we

“wrap” around the sequence. That is, for an entry with index less than window size, the left-most part of the window is small than the right – we fix this by enlarging the left-part of the context to include the last entries of the sequence – wrap around). This is done to avoid overestimating (and hence sampling too many) pairs at the beginning of the sequence. Note that this introduces synthetic/false pairings but due to the average length of our sequence and a window size of 10, we deem this to be negligible. The same is done for the end of the sequence. We recently realized that a preferable method would be to decrease the sampling frequency of entries at the beginning and end of the sequence.

Skip-Gram was originally introduced with the creation several pairs for each input and a single parse through the sequence, but this is nearly equivalent to passing through the dataset several times (and thus obtaining several pairs for each input). We perform the latter, and this small change does allow for re-sampling of the same output word, which may put a bit more emphasis on frequent word pairings.

According to the discussion in Section 5.2, to each item, we create the label encoding, e.g. a 1-1 mapping to a unique index. Next, to each index we perform one-hot encoding to create a 1-1 mapping to a unit vector inR^{|I |}. For the practical map composition of Skip-Gram, we understandxas an input vector, which is simply the label combined with one-hot encoding of the event i, resulting in a unit vectore_xinR^{|I |}, and similarly fory.

Algorithm 1:SkipGram pairing generator(sequence) Result:

list sequence ; /* or a vector */

int sequenceEnd = length(sequence);

vectorD= sampling distribution ; /* hyperparameter, a probability vector */

int windowSize ; /* a hyperparameters */

int returnList = list() ; /* initialize empty list */

forindex in sequenceIndexesdo ifindex≤windowSizethen

wrappedContext = Context-wrap around (see text);

sampledEntry = ample from wrapped context;

else ifsequenceEnd - indexthen

wrappedContext = Context-wrap around (see text);

Sample from wrapped context;

else

sampledEntry = sample an entry from the window aroundindex according toD;

pair = (sequence[index],sampledEntry) ; /* a vector or a tuple */

returnList.append(pair);

end end

returnreturnList ; /* a list of pairs */

5.5.2 Skip-Gram mapping

The Skip-Gram composition of mapping is studied in this section. Skip-Gram is merely a composition of mapping having some inputxand true output class y. Here,xandyis a pair from the pairing strategy in Section 5.5.1 and both are high-dimensional unit vectors inR^{|I |}. The definition of Skip-Gram is simple and contains only a few mapping. These are given as

f1:R^{|I |}→R^D, f1(x) =Wx f2:R^D→R^{|I |}, f2(z) =W⁰z f3:R^{|I |}→R^{|I |}, f3(z) = softmax(z), with the composition

x7→y(x) =ˆ f3(f2(f1(x))) =W⁰Wx,

whereWis an|I |×D-dimensional matrix of weights (or, paramters),W⁰is aD×

|I |-dimensional matrix of weights (or parameters),Ddenotes the embedding dimension (a hyperparameter, see Section 5.5.4) and finally softmax denotes

the function

softmax :R^{|I |}→R^{|I |}, softmax(z) = exp(z_j) P

kexp(zk)

j=1,...,|I |.

Further simplications can be made on the Skip-Gram mapping, sincexandy are unit vectors. Indeed, suppose thatx=e_Iandy=e_O, where theIstands for input vector and theOfor output. Letw_i denote theith column ofWand let w⁰_i denote theith row ofW⁰. Sincexis a unit vector, the functionf1yields the Ith column ofWas can be seen by the following computation







w⁰₁ ... w⁰_{|I |}













w₁ w₂ . . . w|I |











 0

... 0 1 0 ... 0













w⁰₁ ... w⁰_{|I |}











 w_I











 w⁰₁w_I w⁰₂w_I

... w⁰_{|I |}w_I





 ,

where the bars indicate in which direction the vectors extends (e.g. row or column vector). Notice that the final matrix is simply a vector of inner products betweenw⁰_jandw_I. Thus thejth entry in ˆy(x) is given by

y_j(x) = exp(w⁰_jw_I) P

kexp(w⁰_kw_I), j∈1, . . . ,|I |. (5.10) This is interpreted as an estimate of the probability thatyis the context word, given the input wordx, e.g.

P(y|x)←yˆ_j(x).

To measure the error of mapping, we use the cross-entropy loss function, given by

L(y, p) =−

|I |

j=1

y_ilogp_i.

for a probability vectory(true distribution) and another probability vectorp (candidate/estimated distribution) – we interpret the probability vector as a distribution. Since the vectoryis simplye_O, this reduces to

L(y,y(x)) =ˆ −

|I |

j=1

y_jlog ˆy_j(x) =−log exp(w⁰_Ow_I) P

kexp(w⁰_kw_k)

= log X

exp(w⁰_kw_I)

−w⁰_Ow_I.

(5.11)

Using this loss function, we obtain the following total cost function

C=−1 N

j=1

L(y,y(x))ˆ

= 1 N

j=1

log X

exp

w⁰_k[wI]_j

−h w⁰_Oi

j[wI]_j ,

(5.12)

where the sum overN training samples, and [w_I]_j denotes true input class ofjth sample (xj=eI) and similarly foryj=eO. It is common to update the weights after each (or a batch) of samples instead of the total cost function as above. We adopt this approach as well. This modification of the update strategy is mentioned in [22] as being effective at avoiding getting stuck in local minima. It is sometimes referred to asstochasticgradient descent. The only remaining step is to optimize the cost function which will be described in Section 5.5.3.

5.5.3 Skip-Gram gradient descent

In this section we describe the procedure for updating the weights in Skip- Gram, i.e. the matricesWandW⁰. This is done by minimizing the cost function in equation (5.12). Note that the cost function is later modified by Negative Sampling from Section 5.5.5. Letzdenote an entry (or, weight/parameter) fromWorW⁰(for a general mapping, any trainable (free) weight/parameter in the mapping). The goal is to compute

∂L(y,y(x))ˆ

∂z ,

with ˆydefined in equation 5.10 andyis the one-hot encoded vector of the true label. We proceed to updatezat iterationtof the training procedure according to the equation

z^(t)=z^(t⁻¹⁾−α∂L(y,y(x))ˆ

∂z ,

wherez^(t)denotes value ofzat the start of iterationtandαdenotes a learning rate, see Section 5.5.4 and subsection 5.4. Note that we must first compute all such derivatives and then simultaneously update allz.

Letz= (w_i)_jdenote thejth entry in theith column ofW(also commonly known asW_ji) and consider the derivative of the following mapping from equation (5.11)

(w_i)_j7→ L(y,y(x)) = logˆ X

exp(w⁰_kw_I)

−w⁰_Ow_I,

for a fixed pair input/output vectorsx=eI andy =eO from the sampling methodology in Section 5.5.1 and whereyO, ˆyO(x) denotes theOth entry of the

probability vectors. We may divide this into whetheri=I or not, and obtain

∂L

∂(w_i)_j =









 P

k(w⁰_k)jexp(w⁰_kw_I) P

mexp(w⁰mw_I) −(w⁰_O)_j, ifi=I.

0, ifi,I.

Observe that this will only updatew_I,Ith column ofWcorresponding to the input item. Written as a vector, the above gradient is simply

∂L

∂w_I =X

" exp(w⁰_kw_I) P

mexp(w⁰mw_I)

w⁰_k−w⁰_O

= exp(w⁰_Ow_I) P

mexp(w⁰mw_I)−1

∈(−1,0)

w⁰_O+X

k,O

" exp(w⁰_kw_I) P

mexp(w⁰mw_I)

# w_k,

∂L

∂w,I = 0,

(5.13)

which results in the update w^(t)_I =w^(t_I⁻¹⁾−α

kw⁰_kexp(w⁰_kw_I) P

kexp(w⁰_kw_I) − exp(w⁰_kw_I) P

mexp(w⁰mw_I)−1

! w⁰_O

. (5.14) Note that this update adds a bit ofw⁰_Ofromw_I while subtracting a bit ofw⁰_k,O. The effect is that the inner producthw⁰_O,wIiis increased while it is decreased forw⁰_k,O. For the probability estimate ˆy, this results in increased ˆy_Owhile it decreases ˆy_k,O. This makes intuitively sense, i.e. we observe the pair (x, y) and the update increases the estimated probability of observing the pair.

Similarly, letz= (w⁰_i)_j(also known asW⁰_ij). We consider the mapping (w⁰_i)_j7→ L(y,y(x)) =ˆ −y_Olog ˆy_O(x) = log X

exp(w⁰_kw_I)

−w⁰_Ow_I, and wish to compute its derivative with respect to (w⁰_i)_j. Once again, we may divide into the cases whetherO=ior not, and obtain

∂L

∂(w⁰_i)j

= (w_I)_j exp(w⁰_Ow_I) P

kexp(w⁰_kw_I)−1^{i=O}

. This results in the following parameter update

(w⁰_i)^(t)_j = (w⁰_i)^(t_j⁻¹⁾−α(w_I)_j exp(w⁰_Ow_I) P

kexp(w⁰_kw_I)−1^{i=O}

. Written as a vector, this corresponds to

w⁰_i^(t)=w⁰_i^(t⁻¹⁾−α(w_I) exp(w⁰_Ow_I) P

kexp(w⁰_kw_I)−1^{i=O}

. (5.15)

Observe that we always update all ofW⁰by this. This update adds a bit ofw_I tow⁰_Oand subtracts a bit ofw_I fromw⁰_k,O. This increases the inner products hw_I,w⁰_Oi, while decreasinghw_I,w⁰_k,Oi. Thus the estimated probability ˆy_O is once again increased and ˆy_k,Ois decreased.

5.5.4 Skip-Gram hyper-parameters

Skip-Gram contains a multitude of hyper-parameters, all of which affect the cost function. Hence to obtain a good result, it is important to study each of them and how they affect both the cost function and the interpretation of the model. Overall, we have the following hyper-parameters along their effect described briefly in the parenthesis

1. Embedding dimension (affects the dimension of the embeddings).

2. Negative sampling (from [15], both if used and the amount of negative samples).

3. Window size for the context (affects the possible set of pairings).

4. Window sampling distribution (affects the frequency of pairings).

5. Cut-offlevel (remove rare entries and homogenize – but at the cost of throwing away/masking rare events).

6. Cut-offtechnique (replace entry by a standard token in the sequence or completely remove entry from the sequence).

7. Noise distribution for Negative Sampling (affects the frequency of items with which items are used as negative samples).

Furthermore, initialization of weight matrices is not a hyper-parameter that affects the obtained representations but it may nonetheless be useful for faster convergence of the algorithm.

5.5.5 Negative Sampling

Negative Sampling was introduced in [15] as a computational optimization for Skip-Gram. The paper introduces Negative Sampling as the procedure performed by optimizing the following cost function,

σ(w⁰_Ow_I) + Xk

i=1

E_w

i∼P_n(w)[logσ(−w_iw_I)], (5.16) whereσ(x) = 1/(1 + exp(−x)),kdenotes the number of negative samples,P_n(w) denotes a noise distribution on all items andw_iis the vector representation of an itemisampled from the noise distributionP_n. The goal of equation (5.16) is to train the model to distinguish between the actual output and the noise distribution – as described in the original methodology paper [7] on noise- contrastive estimation, of which Negative Sampling is a special case. It describes the training task as learning to distinguish between the target word (item)w_Oand the noise samples (w_i)1≤i≤k. The paper [6] describes this change in the cost function in further detail and observe that Negative Sampling modi- fies the original training objective in equation (5.11) by introducing a different objective than original Skip-Gram.

Define the set of all item and context pairs (w, w_c) that occur in the text, denotedD, by

DB{(w, wc)|(w, wc) is a possible pair from the pairing generators}. Another way of formulating equation (5.16) is to introduce a random variable Z which indicates whether the pair (w, wc) is the true output (Z = 1) or a sample from the noise distribution (Z= 0). In other words, we seek to find the optimal set of parameterθ= (W,W⁰) such that the following joint probability is maximized

argmax

P(Z= 1|w, wc) Yk

i=1

P(Z= 0|w, wi).

which for all samples results in optimizing argmax

(w,w_c)∈D







P(Z= 1|w, w_c) Yk

i=1

P(Z= 0|w, w_i)





 . Optimizing this, is the same as optimizing the log, which results in

argmax

(w,w_c)∈D







logP(Z= 1|w, w_c) + Xk

i=1

logP(Z= 0|w, w_i)







The negative samples (w_i)₁≤i≤kare drawn according to a noise distributionP_n. Most commonly, this distribution utilizes the frequency of itemi, denotedf(i) and defined as

f(i) =count(i) Pp

i=1n_i ,

wheren_i denotes the length of sequences_i, and count(i) denotes the number of times itemioccurs across all sequences. Note that this is a distribution across iand thus could potentially be used as a noise distribution. The authors [15]

suggest a modified version of this, defined as P_n(i) = f(i)^3/4

P^{|I |}

i=1f(i)^3/⁴.

and thus sample thek negative samples according toP_n. In this case, we utilize the observation from natural language processing that the occurrence frequency of words approximately follows a log-uniform distribution (see [19]

for some description and history of the name), given by f(i)∼exp(U(0,log(|I |))),

whereU(0,|I |) denotes the uniform distribution on 0,|I |. We drawknegative samples using this noise distribution for the above equation (5.16). We did not test different choices of noise distribution although recent experimental evidence [2] suggests that other choices of noise distribution may be preferable for non-natural language processing tasks.

No documento PhD Dissertation - Department of Mathematics (páginas 81-89)