• Nenhum resultado encontrado

This relation holds for any givenδ[s]j for 1≤sKby iteration. The procedure of calculating theδthrough the previously calculatedδs is calledback-propagation.

The take-away is that we may computeδj[s]through the previously calculated δ’s (by iterating this procedure backwards) and obtain the desired derivatives for gradient descent through the formula

L

∂w[s]ij = L

h[s]i

h[s]i

∂w[s]ij =δ[s]i zj[s1], wherew[s]ij denotes the (i, j)’th entry ofW[s]in thesth layer.

We will call the inputs items (x, y) for a pair, as it corresponds to an item x and an itemy in the context of x. Creating the pairs is straightforward, disregarding the corner cases of starting and endpoints of the sequence and is illustrated in Algorithm 1. For the beginning and end of the sequence, we

“wrap” around the sequence. That is, for an entry with index less than window size, the left-most part of the window is small than the right – we fix this by enlarging the left-part of the context to include the last entries of the sequence – wrap around). This is done to avoid overestimating (and hence sampling too many) pairs at the beginning of the sequence. Note that this introduces synthetic/false pairings but due to the average length of our sequence and a window size of 10, we deem this to be negligible. The same is done for the end of the sequence. We recently realized that a preferable method would be to decrease the sampling frequency of entries at the beginning and end of the sequence.

Skip-Gram was originally introduced with the creation several pairs for each input and a single parse through the sequence, but this is nearly equivalent to passing through the dataset several times (and thus obtaining several pairs for each input). We perform the latter, and this small change does allow for re-sampling of the same output word, which may put a bit more emphasis on frequent word pairings.

According to the discussion in Section 5.2, to each item, we create the label encoding, e.g. a 1-1 mapping to a unique index. Next, to each index we perform one-hot encoding to create a 1-1 mapping to a unit vector inR|I |. For the practical map composition of Skip-Gram, we understandxas an input vector, which is simply the label combined with one-hot encoding of the event i, resulting in a unit vectorexinR|I |, and similarly fory.

Algorithm 1:SkipGram pairing generator(sequence) Result:

list sequence ; /* or a vector */

int sequenceEnd = length(sequence);

vectorD= sampling distribution ; /* hyperparameter, a probability vector */

int windowSize ; /* a hyperparameters */

int returnList = list() ; /* initialize empty list */

forindex in sequenceIndexesdo ifindexwindowSizethen

wrappedContext = Context-wrap around (see text);

sampledEntry = ample from wrapped context;

else ifsequenceEnd - indexthen

wrappedContext = Context-wrap around (see text);

Sample from wrapped context;

else

sampledEntry = sample an entry from the window aroundindex according toD;

pair = (sequence[index],sampledEntry) ; /* a vector or a tuple */

returnList.append(pair);

end end

returnreturnList ; /* a list of pairs */

5.5.2 Skip-Gram mapping

The Skip-Gram composition of mapping is studied in this section. Skip-Gram is merely a composition of mapping having some inputxand true output class y. Here,xandyis a pair from the pairing strategy in Section 5.5.1 and both are high-dimensional unit vectors inR|I |. The definition of Skip-Gram is simple and contains only a few mapping. These are given as

f1:R|I |→RD, f1(x) =Wx f2:RD→R|I |, f2(z) =W0z f3:R|I |→R|I |, f3(z) = softmax(z), with the composition

x7→y(x) =ˆ f3(f2(f1(x))) =W0Wx,

whereWis an|I |×D-dimensional matrix of weights (or, paramters),W0is aD×

|I |-dimensional matrix of weights (or parameters),Ddenotes the embedding dimension (a hyperparameter, see Section 5.5.4) and finally softmax denotes

the function

softmax :R|I |→R|I |, softmax(z) = exp(zj) P

kexp(zk)

j=1,...,|I |.

Further simplications can be made on the Skip-Gram mapping, sincexandy are unit vectors. Indeed, suppose thatx=eIandy=eO, where theIstands for input vector and theOfor output. Letwi denote theith column ofWand let w0i denote theith row ofW0. Sincexis a unit vector, the functionf1yields the Ith column ofWas can be seen by the following computation











w01 ... w0|I |



















w1 w2 . . . w|I |







































 0

... 0 1 0 ... 0

































=











w01 ... w0|I |



















 wI









=















 w01wI w02wI

... w0|I |wI















,

where the bars indicate in which direction the vectors extends (e.g. row or column vector). Notice that the final matrix is simply a vector of inner products betweenw0jandwI. Thus thejth entry in ˆy(x) is given by

ˆ

yj(x) = exp(w0jwI) P

kexp(w0kwI), j∈1, . . . ,|I |. (5.10) This is interpreted as an estimate of the probability thatyis the context word, given the input wordx, e.g.

P(y|x)←yˆj(x).

To measure the error of mapping, we use the cross-entropy loss function, given by

L(y, p) =−

|I |

X

j=1

yilogpi.

for a probability vectory(true distribution) and another probability vectorp (candidate/estimated distribution) – we interpret the probability vector as a distribution. Since the vectoryis simplyeO, this reduces to

L(y,y(x)) =ˆ −

|I |

X

j=1

yjlog ˆyj(x) =−log exp(w0OwI) P

kexp(w0kwk)

= log X

k

exp(w0kwI)

−w0OwI.

(5.11)

Using this loss function, we obtain the following total cost function

C=−1 N

N

X

j=1

L(y,y(x))ˆ

= 1 N

N

X

j=1

log X

k

exp

w0k[wI]j

−h w0Oi

j[wI]j ,

(5.12)

where the sum overN training samples, and [wI]j denotes true input class ofjth sample (xj=eI) and similarly foryj=eO. It is common to update the weights after each (or a batch) of samples instead of the total cost function as above. We adopt this approach as well. This modification of the update strategy is mentioned in [22] as being effective at avoiding getting stuck in local minima. It is sometimes referred to asstochasticgradient descent. The only remaining step is to optimize the cost function which will be described in Section 5.5.3.

5.5.3 Skip-Gram gradient descent

In this section we describe the procedure for updating the weights in Skip- Gram, i.e. the matricesWandW0. This is done by minimizing the cost function in equation (5.12). Note that the cost function is later modified by Negative Sampling from Section 5.5.5. Letzdenote an entry (or, weight/parameter) fromWorW0(for a general mapping, any trainable (free) weight/parameter in the mapping). The goal is to compute

L(y,y(x))ˆ

∂z ,

with ˆydefined in equation 5.10 andyis the one-hot encoded vector of the true label. We proceed to updatezat iterationtof the training procedure according to the equation

z(t)=z(t1)α∂L(y,y(x))ˆ

∂z ,

wherez(t)denotes value ofzat the start of iterationtandαdenotes a learning rate, see Section 5.5.4 and subsection 5.4. Note that we must first compute all such derivatives and then simultaneously update allz.

Letz= (wi)jdenote thejth entry in theith column ofW(also commonly known asWji) and consider the derivative of the following mapping from equation (5.11)

(wi)j7→ L(y,y(x)) = logˆ X

k

exp(w0kwI)

−w0OwI,

for a fixed pair input/output vectorsx=eI andy =eO from the sampling methodology in Section 5.5.1 and whereyO, ˆyO(x) denotes theOth entry of the

probability vectors. We may divide this into whetheri=I or not, and obtain

L

(wi)j =









 P

k(w0k)jexp(w0kwI) P

mexp(w0mwI) −(w0O)j, ifi=I.

0, ifi,I.

Observe that this will only updatewI,Ith column ofWcorresponding to the input item. Written as a vector, the above gradient is simply

∂L

wI =X

k

" exp(w0kwI) P

mexp(w0mwI)

#

w0k−w0O

= exp(w0OwI) P

mexp(w0mwI)−1

!

(1,0)

w0O+X

k,O

" exp(w0kwI) P

mexp(w0mwI)

# wk,

∂L

w,I = 0,

(5.13)

which results in the update w(t)I =w(tI1)α

P

kw0kexp(w0kwI) P

kexp(w0kwI) − exp(w0kwI) P

mexp(w0mwI)−1

! w0O

. (5.14) Note that this update adds a bit ofw0OfromwI while subtracting a bit ofw0k,O. The effect is that the inner producthw0O,wIiis increased while it is decreased forw0k,O. For the probability estimate ˆy, this results in increased ˆyOwhile it decreases ˆyk,O. This makes intuitively sense, i.e. we observe the pair (x, y) and the update increases the estimated probability of observing the pair.

Similarly, letz= (w0i)j(also known asW0ij). We consider the mapping (w0i)j7→ L(y,y(x)) =ˆ −yOlog ˆyO(x) = log X

k

exp(w0kwI)

−w0OwI, and wish to compute its derivative with respect to (w0i)j. Once again, we may divide into the cases whetherO=ior not, and obtain

∂L

(w0i)j

= (wI)j exp(w0OwI) P

kexp(w0kwI)−1{i=O}

. This results in the following parameter update

(w0i)(t)j = (w0i)(tj1)α(wI)j exp(w0OwI) P

kexp(w0kwI)−1{i=O}

. Written as a vector, this corresponds to

w0i(t)=w0i(t1)α(wI) exp(w0OwI) P

kexp(w0kwI)−1{i=O}

. (5.15)

Observe that we always update all ofW0by this. This update adds a bit ofwI tow0Oand subtracts a bit ofwI fromw0k,O. This increases the inner products hwI,w0Oi, while decreasinghwI,w0k,Oi. Thus the estimated probability ˆyO is once again increased and ˆyk,Ois decreased.

5.5.4 Skip-Gram hyper-parameters

Skip-Gram contains a multitude of hyper-parameters, all of which affect the cost function. Hence to obtain a good result, it is important to study each of them and how they affect both the cost function and the interpretation of the model. Overall, we have the following hyper-parameters along their effect described briefly in the parenthesis

1. Embedding dimension (affects the dimension of the embeddings).

2. Negative sampling (from [15], both if used and the amount of negative samples).

3. Window size for the context (affects the possible set of pairings).

4. Window sampling distribution (affects the frequency of pairings).

5. Cut-offlevel (remove rare entries and homogenize – but at the cost of throwing away/masking rare events).

6. Cut-offtechnique (replace entry by a standard token in the sequence or completely remove entry from the sequence).

7. Noise distribution for Negative Sampling (affects the frequency of items with which items are used as negative samples).

Furthermore, initialization of weight matrices is not a hyper-parameter that affects the obtained representations but it may nonetheless be useful for faster convergence of the algorithm.

5.5.5 Negative Sampling

Negative Sampling was introduced in [15] as a computational optimization for Skip-Gram. The paper introduces Negative Sampling as the procedure performed by optimizing the following cost function,

σ(w0OwI) + Xk

i=1

Ew

iPn(w)[logσ(−wiwI)], (5.16) whereσ(x) = 1/(1 + exp(−x)),kdenotes the number of negative samples,Pn(w) denotes a noise distribution on all items andwiis the vector representation of an itemisampled from the noise distributionPn. The goal of equation (5.16) is to train the model to distinguish between the actual output and the noise distribution – as described in the original methodology paper [7] on noise- contrastive estimation, of which Negative Sampling is a special case. It de- scribes the training task as learning to distinguish between the target word (item)wOand the noise samples (wi)1ik. The paper [6] describes this change in the cost function in further detail and observe that Negative Sampling modi- fies the original training objective in equation (5.11) by introducing a different objective than original Skip-Gram.

Define the set of all item and context pairs (w, wc) that occur in the text, denotedD, by

DB{(w, wc)|(w, wc) is a possible pair from the pairing generators}. Another way of formulating equation (5.16) is to introduce a random variable Z which indicates whether the pair (w, wc) is the true output (Z = 1) or a sample from the noise distribution (Z= 0). In other words, we seek to find the optimal set of parameterθ= (W,W0) such that the following joint probability is maximized

argmax

θ

P(Z= 1|w, wc) Yk

i=1

P(Z= 0|w, wi).

which for all samples results in optimizing argmax

θ

Y

(w,wc)D







P(Z= 1|w, wc) Yk

i=1

P(Z= 0|w, wi)





 . Optimizing this, is the same as optimizing the log, which results in

argmax

θ

X

(w,wc)D







logP(Z= 1|w, wc) + Xk

i=1

logP(Z= 0|w, wi)







The negative samples (wi)1ikare drawn according to a noise distributionPn. Most commonly, this distribution utilizes the frequency of itemi, denotedf(i) and defined as

f(i) =count(i) Pp

i=1ni ,

whereni denotes the length of sequencesi, and count(i) denotes the number of times itemioccurs across all sequences. Note that this is a distribution across iand thus could potentially be used as a noise distribution. The authors [15]

suggest a modified version of this, defined as Pn(i) = f(i)3/4

P|I |

i=1f(i)3/4.

and thus sample thek negative samples according toPn. In this case, we utilize the observation from natural language processing that the occurrence frequency of words approximately follows a log-uniform distribution (see [19]

for some description and history of the name), given by f(i)∼exp(U(0,log(|I |))),

whereU(0,|I |) denotes the uniform distribution on 0,|I |. We drawknegative samples using this noise distribution for the above equation (5.16). We did not test different choices of noise distribution although recent experimental evidence [2] suggests that other choices of noise distribution may be preferable for non-natural language processing tasks.