Improving Multi Word Expression extraction

3

P r o p o s e d i m p r o v e m e n t s t o L o c a l M a x s

As mentioned previously, even though LocalMaxs algorithm extracts Relevant Expres-sions with decent Precision, it is not impressive. It is restricted to only extractingn-grams withnbeing greater that 1, thus unable to retrieve single words. This algorithm uses the cohesion score of an expression to extract it. This cohesion represents the sense of ’glue’

between the words of an expression.

We propose to expand the model, taking advantage of this cohesion metric but also adding other important factors when attributing relevance to a phrase. In this chapter we present some additions to LocalMaxs, aiming to improve its Precision and not damaging its recall results in collecting MWE and to enable it to also gather single meaningful words.

3 . 1 . I M P R O V I N G M U LT I WO R D E X P R E S S I O N E X T R AC T I O N

3.1.1 LocalMaxs application for candidates extraction

To process a digital sequence of terms, the text must me formatted in a way that the algorithm knows where a word begins and where it ends. Hence, the first stage consists in applying basic formatting over the fullcorpus: every set of words are to be separated by a blank space from commas, periods, parenthesis and other non-letter typing forms, and all letters are converted to lower case, as shown in3.1.

Table 3.1: Basic char formatting

RAW TEXT He wasn’t, they thought, after them.

FORMATTED TEXT he wasn ’ t , they thought , after them .

This insertion of the space character does not change the semantics of the text, while enabling a more correct counting of the true term occurrences. Ex: in the textJohn and Mary ate some food. John, who was hungry, eat too much., the word John would only be counted once, even though it should be captured twice, as inJohn and Mary ate some food . John , who was hungry , eat too much .

Now that all words are delimited by blank spaces, the program goes through the document and saves all1-gramsin a dictionary and alln-grams (ngoing from 2 to 7) in another, having the term/expression as the key and its absolute frequency as one of the values. The one-word and multi word dictionaries are further mentioned asD1 andD2, respectively.

Then, proceeds the calculation of the cohesion of alln-grams present inD2 using the SCP, Dice,MIandφ²cohesion metrics. Notice that the algorithm works with just one of these metrics, at a time. Although the results obtained with these different metrics will be used for evaluation. This calculation is followed by the computation of theΩ_n−1and Ω_n+1values for everyn-gram, as explained in Subsection2.3.11.

D2 is a more complex structure. For each n-gram, it holds two lists of values that describe it: the first keeps its absolute frequency, its cohesion value and the maximum cohesion value of all the corresponding n+1-grams, the second has two values, which keep count of how manyn+1-gramsthe expression in question has, in both extremities.

In order for an expression to be a candidate, this values from the second list must be greater than 1. This criterion improves the Precision of the extraction.

In order to eliminate typographical errors, we opted to disregard expressions that present an absolute frequency of 1. This does not seem to delete expressions that are relevant in the document, as these, generally, appear more than once. In fact, this criterion proved to be more beneficial for the Precision measure than detrimental for the Recall.

C H A P T E R 3 . P R O P O S E D I M P R O V E M E N T S T O L O C A L M A X S

3.1.1.1 Forbidden characters

Before the next step, a simple filtering must take place to eliminate obvious unwelcome expressions. After extracting the candidates, some expressions may still contain forbid-den characters, such as commas, parenthesis, hyphens, to name a few. In this filtering step, the algorithm runs over all extracted expressions and eliminates every one contain-ing this characters, as no sentence populated by them is relevant.

3.1.2 Automatic identification of Stop-words – The Stop-word List

This step comprises in trimming the list retrieved in the previous phase. After the ex-traction, the list ofMWEis still filled with unimportant and meaningless words. These candidates are to be pre-processed, looking for those that do not respect the conditions imposed, which will be explained below, and discarding them if they don’t comply. In order to achieve this, we developed a technique that enables our program to figure out the most prevalent function words, without having to resort to a dictionary, part-of-speech tags and even knowing which language we are dealing with. We called this technique

"Context Analysis".

3.1.2.1 Wordcontext analysis– Finding thethreshold

Every language needs functional words to connect and articulate its writing and speech.

Articles, prepositions and some adverbs are an example of such words, which tend to bear very little meaning, and are irrelevant for short summaries and conveying ideas using few words.

As mentioned before, in [3], the author attempts to gather such words using absolute frequency, claiming that the most common words in a text are of little significance. Intu-itively, this seems to be rather accurate, as empty words, such as ’the’, ’in’ or ’a’, appear very insistingly in documents.

However, obvious mistakes may come from this approach, as important words can easily be captured by this list, and the author has to specify how many of those frequent words he wants to gather. He always captures 201 whatever thecorpus size, which is dangerous to commit to. In fact, the number of words captured should vary according to thecorpussize.

For example, in a domain-specific document, it is only natural that important domain related terms emerge repeatedly, and to base their exclusion solely on the frequency of their occurrence seems rather rudimentary and error prone.

Instead, we propose a more elaborate mechanism to spot unimportant words, regard-less of the language of thecorpus. This approach tries to capture the very reason why these so called ’function words’ bear very little importance. They all have something in common, they seem to appear many times, scattered around the texts, and as they

3 . 1 . I M P R O V I N G M U LT I WO R D E X P R E S S I O N E X T R AC T I O N

are merely connecting or linking phrases, it is only logical that they share no important connection to their surrounding word.

Every word has at least one neighbour, and almost every time two, one to the right and one to the left. Exemplifying, in ’John went to school’, the word ’went’ bears two neigh-bours: ’John’ and ’to’. Our approach collects how manydistinctneighbours each word has.

Its is intuitive that, that more distinct neighbours a word has, the less stuck to an idea it is, meaning it is always hopping around, bring no useful meaning to its surroundings. In other words, stop-words tend to have no preference to specific neighbours.

Words that show a large number of distinct neighbours, are regarded as function words, which are uninformative or weakly informative, and merely act as connectors between terms (i.e: ’ JohnandPaul’, ’Presidentof Portugal’ or ’TaxesinEurope’).

In order to select the right number of function words from this technique, we wanted to find a way to organically find a threshold of separation, instead of extracting a fixed number of words.

Finding thethreshold– Firstly, the process begins by sorting all the words in the text in increasing order with respect to the number of distinct neighbours. Then, we plot a chart representing the index of the word and its number of neighbours, in the x andy axis, respectively, as shown below in Figure3.1. In other words, letneigh(x) be the number of distinct neighbours of the word indexed byxand then,neigh(x_j)≥neigh(x_i) forx_j > x_i.

Figure 3.1: Words represented in according to their number of distinct neighbours

At first glance, the chart looks like having two separate lines, almost like one horizon-tal and one vertical. But by taking a closer look, we can see it starts to grow very fast, resembling an exponential function, which can be easily perceived in Figure3.2).

C H A P T E R 3 . P R O P O S E D I M P R O V E M E N T S T O L O C A L M A X S

Figure 3.2: Amplified distinct neighbours

The objective is to capture the part of the chart where the number of neighbours soars, a region we called theelbow, thus retaining the ’function words’ and building the Stop-word list. This was achieved by calculating consecutive gradient values in order to discover where the most abrupt changes in derivativesdy/dxhappened. Since we do not know the function that fits this curve, we can not calculate the exact derivatives in each point. Nonetheless, we may approximate these values by using a small∆xinstead of an infinitesimaldx, with∆xbeing ajumpthat was set to 2.

Exemplifying, values inxare scanned from the rightmostxto the leftmost one, always jumping∆_x, and measuring the corrsponding∆_y. So, The higher the∆y/∆x value, the higher the slope of the tangent, and we want to capture the spot where the tangent varies the most. Consequently, we calculate the ratio of all consecutive tangents obtained previously, and keep the index in which the value is maximized, finding thethreshold. In other words, the thresholdthis obtained by

th= [argmax

( t_n t_n−1

)×∆_x] (3.1)

where

t_n=∆y_n

∆x_n (3.2)

and

∆y_n=











neigh(x_n∗∆_x)−neigh(x_(n−1)∗∆_x) forx < x_max

neigh(x_max)−neigh(x_max−∆_x) forx=x_max (3.3) Thus,t_n stands for then_th tangent value. It is important to mention that, by setting different∆_x values, the point of the curve where theelbowis found varies, and it can be

3 . 1 . I M P R O V I N G M U LT I WO R D E X P R E S S I O N E X T R AC T I O N

used to privilege the Precision or the Recall of the Stop-word list. We decided to give more importance to Precision, and end up with a ∆_x value of 2. This criterion enables a better quality of future extracted Relevant Expressions, as it will be explained later in this work. By applying this technique, we find the point that attempts to separate the function words from non-function words, as seen in Figure3.3:

Figure 3.3: Capturing theelbow

Thus, we define the Stop-words list as

StopW ords={w|index(w)≥th} (3.4) So, StopW ords set contains all the words w whose index, obtained by the ascending order of the number of neighbours ofw, is equal or greater than the threshold defined by Equation (3.1).

From this threshold onward, we find terms that are not specially glued to any specific words, hinting they have nocharacter, that is, belonging to no context. This also means we cannot infer the meaning of these words by its surroundings, as they present no contextual preference. In Table3.2there are displayed a number of words captured by this method.

This way, the most unimportant words are captured, with very low error, as will be shown in Chapter4, and the only measure used was the number of distinct neighbours of words, needing no predefined threshold or semantic information whatsoever. This was one of the objectives of this dissertation.

As far as my research went, I could not find another algorithm or technique that was able to extract these irrelevant words without prior semantic information, or through an empirical/manual manner.

C H A P T E R 3 . P R O P O S E D I M P R O V E M E N T S T O L O C A L M A X S

Table 3.2: Examples of captured function words Portuguese English German

e in zwei

a also ins

de that ich

que which nun

com from kann

para with dass

Furthermore, using this technique and thus building the Stop-word list, we realise there is a statistic pattern that explains why those words bear no meaning. This enables us to obtain these words merely relying on statistics, regardless of the language and not needing to know which one it is.

3.1.3 Using the Stop-word list to improve LocalMaxs

We denoted that, after applying the LocalMaxs algorithm to a corpus, many selected expressions considered as relevant had function words as delimiters, which should not happen. This is one of the main handicaps of the original algorithm, penalizing the Precision measure. Thus, by focusing on this issue, we aim to mitigate this shortcoming.

As such, restrictions to the extremities of these extracted expressions were made by applying a new condition that stated that its first and last (the leftmost and the right-most) words must not be present on the previously obtained Stop-word list. Table 3.3 shows some not so relevant expressions with function terms on the extremity, prior to the enforcement of this restriction.

Table 3.3: Expressions with weak extremities

EXPRESSION Weak term(s)/ Stop-word(s)

how long how

did you tell them that did/that

for once in my life for

she is driving me bananas she

will the defendant please rise will

Consequently, every expression that starts or ends with terms that are found on the Stop-word list computed in Step 2, are to be removed from the Relevant Expressions list, as it should not be there.

After this last step, we end up with expressions that neither begin nor end with the found function words. We know that the Stop-word list doesn’t capture all unimportant connective words, but it presents a close to 100% Precision, meaning that almost all it cap-tures are in fact function words. Thus, the candidate expressions that were deleted, were

3 . 1 . I M P R O V I N G M U LT I WO R D E X P R E S S I O N E X T R AC T I O N

in fact irrelevant expressions. This is important because this filter causes no decreasing on the Recall of the original LocalMaxs algorithm, while improving the Precision.

Figure3.4shows the full extraction process of the Multi Word Expressions. Phases 1 and 2 corresponds to Step 1 mentioned above, which stand for the original LocalMaxs algorithm. Phase 3 includes the improvement tools proposed in this dissertation to the multi word extraction part, that comprises the automatic compilation of the Stop-word list (Step 2), and the filter that applies it to trim the previously extracted expressions (step 3). Thus, the result of the Phase 3 produces the improved final list of Relevant Expressions, terminating one of the objectives of the dissertation.

Figure 3.4: Summary of the Relevant Expressions extraction process

Table3.4shows some examples of Relevant Expressions extracted by the improved algorithm. We can recognize that most items in the produced list are relevant in the three languages. In Section 3.3we formally define the Improved LocalMaxs, where the new extraction of MWE are included. In Chapter4we analyse in detail the relevance of the items extracted by the algorithm.

C H A P T E R 3 . P R O P O S E D I M P R O V E M E N T S T O L O C A L M A X S

Table 3.4: Examples of expressions extracted

English Portuguese German

noise pollution todos esses sinais möglichen auswirkungen

claire moore entrar em colapso test veranlasst

stan laurel objectivo sensibilizar öffentliche verkehrsmittel

banking and insurance clientes do bpp warren buffett

guiding light explorar as potencialidades redaktion vorliegt practice of law representantes do ministério da economia junge aufgezogen papers were published todos estes elementos pandemie reagiert

3.2 1-Gram Extraction

In any corpus, the proportion of relevant singles words (1-grams) in the total relevant terms is too large to be disregarded. Is in this section we propose a mechanism to retrieve relevant 1-grams from a text using the LocalMaxs algorithm. To extend the LocalMaxs to be able to extract relevant 1-grams, we presume to take advantage of the Relevant Expressions provided by the latter and another complementary technique that will be further detailed.

The extraction in the context of the original LocalMaxs is based on the concept of cohe-siveness/glue between consecutive words. Consequently, when the focus is the extraction of single words, the approach can not be the same, since there are no glue assigned to each individual terms.

The most popular method to measure the importance of single words in documents is the TF−IDF metric. This led us to consider it as a possible tool to help on the extraction of relevant single words, combined with other techniques. Although, this idea was discarded because we didn’t want to be limited to analysingcorpuswith multiple documents, as it is necessary for this metric to return fair results, as can be seen in Equation (2.1). Conse-quently, the extraction will rely solely on the LocalMaxs’ extracted Relevant Expressions and a technique revolving in anothercontext analysis(further discussed below).

This extraction will be obtained in two consecutive steps, detailed in subsections3.2.1 and3.2.2, respectively:

• Taking advantage of the Relevant Expressions extracted through the LocalMaxs, a list of candidate single words is compiled:

• Building anotherelbowshaped chart, but this time using, not only the number of distinct neighbors of each word, but also its length;

3.2.1 Initial filtering

The presence of a word the in LocalMaxs’ extracted Relevant Expressions was determined to be an important factor to bear in mind when selecting documents relevant words. The words that incorporate the extremities of the MWE extracted through the LocalMaxs

3 . 2 . 1 - G R A M E X T R AC T I O N

algorithm, are, most of them, of extreme importance and relevance. Furthermore due to the nature of the algorithm, all relevant single words tend to appear either on the left or on the right of some Relevant Expressions. Remember that both the delimiters were subject to context analysis(distinct neighbour count), in the process of improving the quality of Relevant Expressions.

So in this first step we select all the extremities from the gathered Relevant Expres-sions, and compile a list containing all these words. From this, results a list of candidate single words, that will be further processed in the next step.

3.2.2 Context analysisfor 1-gram extraction

From this moment on, the algorithm will only be handling the words from the list that was returned in the previous phase. In this step, the number of distinct neighbours of each word is collected, similarly to what was done for Relevant Expressions, as well as their length (number of characters). This last component was useful (unlike when analysing MWE, where there are, naturally, lots of small words connecting terms), because longer words tend to carry more relevance than shorter ones. Exemplifying, ’red’, "big’, ’use’ are less relevant than ’agriculture’, ’economy’, ’business’, etc.. Then, for each word, the ratio between their number of neighbors and their size must be taken into account.

Thus let us define thecontext_unigram(w) ratio as:

context−unigram(w) =neighbours(w)

length(w) (3.5)

and then let us apply this ratio for all words from a generic set, and select those that are on the left side of theelbowformed, using a criterion that will be detailed below.

After calculating thecontext−unigram(w) value for every word in the input list, they are then sorted in increasing order according to this ratio, as plotted in Figure3.5.

C H A P T E R 3 . P R O P O S E D I M P R O V E M E N T S T O L O C A L M A X S

Figure 3.5: Words represented according to their number of neighbours and size, context_unigram(.) ratio

Again, the objective was then to, mathematically, spot theelbow, as shown in Figure 3.6, using gradients and find the place where words start to show sudden increases in the value ofcontext_unigram(.)(Equation (3.5)). Thus,

th= [argmax

( t_n t_n−1

)×∆_x] (3.6)

where

t_n=∆y_n

∆x_n (3.7)

and

∆y_n=











y(x_n∗∆_x)−y(x_(n−1)∗∆_x) forx < x_max y(x_max)−y(x_max−∆_x) forx=x_max

(3.8)

wherey(x) =context_unigram(w(x)) andw(x) returns the word indexed byxin chart of Figure3.6.

3 . 2 . 1 - G R A M E X T R AC T I O N

Figure 3.6: Capturing the elbow according to their number of neighbours and size, context_unigram(.)

Now we define a function returning the set of single words that were found before (left) theelbow.

rel_words(W) ={w|w∈W ∧index(w)< th} (3.9) whereindex(w) returns the index of wordwin the chart of Figure3.6, that is according to itscontext_unigram(w) ratio, andthin given by Equation (3.6) and represents the index of the threshold.

As it will be explained in Section3.3, this function will be used in the formal definition of the Improved LocalMaxs, in order to obtain the relevant single word set.

After this process, we are left with a shorter list of 1-grams, that was rid of terms with little to no meaning. This way we were able to eliminate words that, in spite of being in the edges ofMWE, bore no relevance or were only relevant in a phrase. In the case of the latter, I am mostly referring to small words, that are relevant as the beginning or end of expressions, but alone reveal very little information (i.e ’all’ (as in ’all stars’), ’FC’ (as in

’FC Bayern’), ’bad’ (as in ’bad boys’).

During the implementation, it was also tried to use the word absolute frequency in the formula (3.5), in the numerator multiplying theneighbours(w) value. This attempt brought many errors, because words that were very frequent but had few distinct neigh-bours, which are words with plenty of relevance, ended up with a very high value of context, which leads to their deletion. As an example, the word ’Donald’ tends to have two predominant neighbours (’president’ on the left and ’Trump’ on the right), occur-ring very frequently, still being very relevant. This fact led me to decide to only use the neighbours(w) andlength(w) factors.

In Figure3.7, the whole process of extraction relevant 1-grams, is summarized.

No documento Enhancements onMultiword Extraction and Inclusion of Relevant SingleWords on LocalMaxs (páginas 36-49)