Enhancements onMultiword Extraction and Inclusion of Relevant SingleWords on LocalMaxs

(1)

DEPARTMENT OF COMPUTER SCIENCE

TOMÁS MARIA FERNANDES FARRAJOTA DE MORAIS ALVES

Bachelor in h Computer Science i and h Business Administration i

ENHANCEMENTS ON MULTIWORD EXTRACTION AND INCLUSION OF RELEVANT SINGLE WORDS ON LOCALMAXS

MASTER IN COMPUTER SCIENCE

NOVA University Lisbon

(2)

DEPARTMENT OF COMPUTER SCIENCE

ENHANCEMENTS ON MULTIWORD EXTRACTION AND INCLUSION OF RELEVANT SINGLE WORDS ON

LOCALMAXS

TOMÁS MARIA FERNANDES FARRAJOTA DE MORAIS ALVES Bachelor in h Computer Science i and h Business Administration i

Adviser: Joaquim Francisco Ferreira da Silva

Assistant Professor, NOVA University Lisbon

MASTER IN COMPUTER SCIENCE

(3)

Enhancements on Multiword Extraction and Inclusion of Relevant Single Words on LocalMaxs

The NOVA School of Science and Technology and the NOVA University Lisbon have the right, perpetual and without geographical boundaries, to file and publish this dissertation through printed copies reproduced on paper or on digital form, or by any other means known or that may be invented, and to disseminate through scientific repositories and admit its copying and distribution for non-commercial, educational or research purposes, as long as credit is given to the author and editor.

(4)

A c k n o w l e d g e m e n t s

First and foremost, I would like to thank all my academic circle that contributed for the construction of this dissertation: the University (FCT), the computer science department and all the professors.

A special thanks to my advisor, Joaquim Francisco Ferreira da Silva, that was always available and eager to help on the development of this project. It was really a pleasure and honor to work with him.

A humble thanks to my entire family, specially my parents, who provided for me and encouraged me to follow a computer science degree, even though this was not my first degree and I knew I would conclude it a bit older than my fellow colleagues. Also to my brother and sister, Gonçalo and Teresa, that listened whenever I, rather enthusiastically, spoke of my thesis. And, of course, to Eulália, my pretend-grandmother and dear friend, that puts up with a lot.

To all my friends, a very big thanks, for they also listened and sometimes even gave tips about this project. A special thanks to my big friends in the University, João Murgeiro, José Almeida, Tomás Pereira and Vasco Castanheira, that came along in this 5 year journey , whom I was fortunate enough to meet and become intimate with. They were crucial for the completion of this degree and I appreciate that.

(5)

A b s t r a c t

The digital information available to us reproduces itself in an overwhelmingly rapid way. Following advances in Text Mining, this large amount of information can now be processed and understood more swiftly by people. For this purpose, the concept of extracting Relevant Expressions and Keywords from a text becomes an important task.

This process consists in retrieving the most important ideas from a document or set of documents, which can be done using statistical and/or linguistic tools, being the first the focus of this work.

In order to extract these terminologies using statistical methodologies, one must take advantage of patterns that indicate importance in a word/expression. Relevant Expres- sions tend to present some singularities, as the words therein, seem to have, for example, high values of cohesion between them, conveying importance.

The LocalMaxs is an algorithm that uses this cohesion metric between words to capture meaningful Multi Word Expressions from a text, with an average Precision close to 70%, but it is not able to extract 1-grams (single words). This dissertation aims at improving the performance of this algorithm, as well as including the newly added Rel- evant Single Words, which is an important factor specially in languages where relevant compound nouns come in long words (i.e. German). These improvements must be made keeping language independence.

Keywords:LocalMaxs, Relevant Expressions, Multi Word Expression , Relevant Single Words

(6)

R e s u m o

A informação disponível em forma digital aumenta a uma velocidade estonteante, tor- nando difícil o seu processamento e acompanhamento. Utilizando técnicas deText Mining, esta grande quantidade de informação pode ser lida e compreendida de forma mais expe- dita por Humanos. A extração de Expressões e Termos Relevantes é um processo crucial para a decomposição de um documento ou grupo de documentos, e consiste na reco- lha dos conceitos mais importantes dos mesmos. Este processo é realizado através da utilização de ferramentas estatísticas (focadas neste trabalho) e/ou linguísticas.

Para extrair estas terminologias utilizando métodos estatísticos, têm que ser encon- trados padrões que indiquem e apontem para a importância e relevância de uma pala- vra/expressão. Expressões Relevantes apresentam várias características que as definem, sendo uma das quais a verificação de altos valores de coesão estatística entre as palavras que as compõem.

O algoritmo LocalMaxs utiliza estes valores de coesão entre palavras para extraír Ex- pressões Relevantes de um texto, com uma precisão de aproximadamente 70%. Não con- segue, no entanto, extrair 1-gramas (palavras isoladas) Relevantes. Esta dissertação tem como objetivo melhorar aperformancena extração de Expressões Relevantes do algoritmo LocalMaxs, bem como criar mecanismos que o permitam extrair 1-gramas relevantes.

Estes melhoramentos devem manter o algoritmo independente da língua do texto em análise.

Palavras-chave:Expressões Relevantes, LocalMaxs, Palavras Relevantes

(7)

C o n t e n t s

List of Figures ix

List of Tables x

Acronyms xii

1 Introduction 1

1.1 Description . . . 1

1.2 Motivation . . . 2

1.3 Objectives . . . 3

2 Related Work 4 2.1 Introduction . . . 4

2.2 Linguistic extraction . . . 6

2.2.1 Stop-word list . . . 7

2.2.2 Part-of-Speech Tagging and Filtering . . . 7

2.2.3 Lemmatization and Stemming . . . 8

2.2.4 Noun Chunking . . . 8

2.2.5 Taxonomy creation and disambiguation . . . 9

2.2.6 Metaphor identification . . . 9

2.2.7 Technique for extremely specific documents . . . 10

2.2.8 Polarity detection . . . 10

2.3 Statistic extraction. . . 10

2.3.1 Absolute frequency . . . 11

2.3.2 TF-IDF 1-gram extration . . . 12

2.3.3 TF-DCF. . . 13

2.3.4 Termhood analysis . . . 13

2.3.5 Clustering of terms . . . 14

2.3.6 Co-occurrence metrics . . . 14

2.3.7 Sequence Expansion extractor . . . 17

(8)

C O N T E N T S

2.3.8 Xtractor . . . 17

2.3.9 Concept Extractor . . . 18

2.3.10 YAKE! (Yet Another Keyword Extractor) . . . 19

2.3.11 LocalMaxs Extractor . . . 20

2.4 Applications of Relevant Expression extraction . . . 21

3 Proposed improvements to LocalMaxs 24 3.1 Improving Multi Word Expression extraction . . . 24

3.1.1 LocalMaxs application for candidates extraction . . . 25

3.1.2 Automatic identification of Stop-words – The Stop-word List . . 26

3.1.3 Using the Stop-word list to improve LocalMaxs . . . 30

3.2 1-Gram Extraction. . . 32

3.2.1 Initial filtering . . . 32

3.2.2 Context analysisfor 1-gram extraction . . . 33

3.3 Improved LocalMaxs . . . 37

3.4 Towards the evaluation . . . 37

4 Results 39 4.1 Methodology . . . 39

4.2 Evaluation . . . 41

4.2.1 Results on Multi Word Expression extraction . . . 41

4.2.2 Results on newly implemented Relevant Single Word extraction 45 4.3 Conclusions on the results achieved . . . 47

5 Conclusions 49 5.1 Improvements on the Relevant Expressions extractor . . . 49

5.2 Implementation of single words extractor . . . 50

5.3 Future work . . . 50

Bibliography 52

(9)

L i s t o f F i g u r e s

2.1 Concept taxonomy example . . . 9 2.2 Example of neighbourhood glue values in the context of LocalMaxs algorithm 21 2.3 Example of Document Classification . . . 22 3.1 Words represented in according to their number of distinct neighbours . . 27 3.2 Amplified distinct neighbours . . . 28 3.3 Capturing theelbow . . . 29 3.4 Summary of the Relevant Expressions extraction process. . . 31 3.5 Words represented according to their number of neighbours and size,context_unigram(.)

ratio . . . 34

3.6 Capturing theelbowaccording to their number of neighbours and size,context_unigram(.) 35 3.7 Summary of the Relevant 1-gram extraction process . . . 36

(10)

L i s t o f T a b l e s

2.1 Examples of Relevant and Non-Relevant Expressions . . . 4

2.2 Morphosyntactic patterns of terms . . . 6

2.3 Common words for a Stop-word list. . . 7

2.4 Part Of Speach Tagging Example . . . 8

2.5 Lemmatization and Stemmatization examples . . . 8

2.6 Most frequent one and two grams in English . . . 11

2.7 Expectedtf−idf score in domain specific documents. . . 12

2.8 Some results of cohesion metrics of MI and Dice . . . 15

2.9 Examples of Relevant and Non-Relevant Expressions . . . 18

3.1 Basic char formatting . . . 25

3.2 Examples of captured function words. . . 30

3.3 Expressions with weak extremities . . . 30

3.4 Examples of expressions extracted . . . 32

3.5 Examples of 1-grams extracted . . . 36

4.1 Grading of expressions . . . 40

4.2 1-grams classification examples . . . 41

4.3 Precision of original LocalMaxs expression extraction . . . 42

4.4 Precision regrading Multi Word Expressions (MWE)s extraction from the Im- proved LocalMaxs . . . 42

4.5 Recall of original LocalMaxs expression extraction . . . 43

4.6 Recall regrading MWEs extraction from the Improved LocalMaxs . . . 43

4.7 F-score of original LocalMaxs expression extraction. . . 44

4.8 F-score regrading MWE extraction from the Improved LocalMaxs . . . 44

4.9 Example set of MWEs extracted by the Improved LocalMaxs and its classification . . . 45

4.10 Precision regrading Relevant Single Words (RSW) extraction . . . 46

4.11 Recall regrading RSW extraction . . . 46

4.12 F-Score regrading RSW extraction . . . 47

(11)

L I S T O F TA B L E S

4.13 Extracted RSW and its classification. . . 47

(12)

A c r o n y m s

MI Mutual Information3,14,15,17,25,38,41,42,43,45

MWE Multi Word Expressionsx,3,4,5,6,17,18,24,26,32,33,35,37,38,41,42,43, 45,48,49,50

NLP Natural Language Processing4,5,14

RSW Relevant Single Wordsx,xi,3,37,38,41,46,47,48,50

SCP Symmetric Conditional Probability3,16,20,25,38,41,42,43,44,45,46,49

(13)

1

I n t r o d u c t i o n

’I think you can have a ridiculously enormous and complex data set, but if you have the right tools and methodology, then it’s not a problem.’

Aaron Koblin, entrepreneur in data and digital technologies

Text mining is a technique that aims to attain and generate high value information from digital texts, through the extraction terms or phrases. Term or expression extraction using Natural Language Processing is a mechanism that consists in analysing acorpus, or set of documents, and culminates in the retrieval of expressions that, ideally, convey relevant information about the text. Throughout acorpus, important sentences and groups of words tend to follow a set of rules, and the aim of an extractor is to capture these terminologies. These expressions obey certain patterns, whether they are syntactic or statistic observations.

1.1 Description

Analysing a document, for extraction purposes, proves to be a complex and elaborate task, and is not, in most of the cases, very efficient, performance-wise. There are two ways to approach the term or phrase extraction process: linguistically and statistically. The first relies on idiomatic and semantic relations between terms, and depends on preexisting morphological and syntactic information about the text in question, which does not always have the pretended quality. The other approach takes advantage of the statistic patterns of occurrence and co-occurrence described by words, which is, in general, independent of the language in question. When comparing the results of both, linguistic methods tend to perform better in Precision and Recall, always with the set back of being language dependent and requiring lexical and morphosyntactic background information.

In this work, although both strategies will be described, the focus will turn to statistic based extraction of relevant expressions and words. In this dissertation, I want to contribute to improving the statistical extraction of the LocalMaxs algorithm, taking

(14)

C H A P T E R 1 . I N T R O D U C T I O N

into account the potential of the language independent side of this type of terminology retrieval.

1.2 Motivation

In the past 30 years, multiple ways of summarising and retrieving the key ideas from digital unstructured text have come to life. In an era where data is the greatest and most important resource, dealing effectively with continuously growing amounts of information is becoming a crucial but strenuous and demanding task. Sometimes it is technically impossible for humans to deal with such large quantity of data in a reasonable time frame.

Here is where the extraction of relevant expressions comes into play, retrieving the main ideas and enabling the user to absorb the information in a much quicker way.

Upon the extraction of relevant terms and expression, one is able to capture the most important concepts a document tries to convey. In this academic work, statistical measures are privileged over linguistic ones, and the fact that mere positional and cohesion relations between terms translate in a meaningful deconstruction of a piece of digital data, regardless of its language, is mind-blowing.

Term extraction can have several applications in human life. The most direct one is simply retrieving keywords from documents, assisting the user on comprehending the most important ideas written in the text. Another common use is indexation and classification of documents. Document indexation is a technique that associates a document or a set of documents to one or more terms or expressions, making it easy to query for texts. Categorizing documents is important for entities who have to deal with a great number of information. Consequently, grouping texts of the same domain is key, and is attained through supervised or unsupervised document classification. This technique uses the relevant expressions of a text as features. On the one hand, if unsupervised, the algorithms focus on clustering documents that are alike in content. Once a user requests a document, this process eases the access to other related texts, even though they have no type. On the other, when the method is supervised, the objective is to attribute a type or category to each document, based on the features mentioned above, making document querying much more efficient. This indexation and classification of documents makes the inference of implicit relevant expressions possible, which consists in attributing a given expression to a text, where it does not occur. Imagining 10 documents of the same domain: ’Politics’, and all but one had the word ’politics’ in it, being considered a relevant expression feature. If a user were to search for documents using the keyword ’politics’, all ten may be retrieved, even though the word in not present in one. The potential integration of this methods in search engines is clear and very important nowadays.

(15)

1 . 3 . O B J E C T I V E S

1.3 Objectives

As far as retrieving relevant expressions goes, a great number of extractors have been developed. My interests tilt towards the LocalMaxs algorithm, which focuses on evaluating statistical pattern between terms to extract relevant multi-word expressions. One of the most important characteristics of an extractor is its Precision, as its reliance depends on it. The LocalMaxs extractor presents a non-impressive Precision score and cannot extract single words. I have worked with this algorithm before, and believe it has potential to score much higher in Precision, once some improvements are applied. Hence, my proposal is to improve its Precision, not lowering its Recall, as well as to make it capable of extractingRelevant Single Words (RSW). The objective is to maintain Precision regardless of the language, to prove that the algorithm is, in fact, language independent.

As forMulti Word Expressions (MWE), I intend to perform their extraction using various cohesion metrics separately, such as theSymmetric Conditional Probability (SCP), Dice, Mutual Information (MI)andφ², and then evaluate the results and keep the one that performs best in Precision and Recall.

In text mining, the corpus (set of documents) in study should have at least 1 or 2 million words, so it is possible to guarantee the statistical robustness of the results.

Most algorithms for terminology extraction use empirically tested thresholds. For instance, the typical Relevant Expression tends to start with medium to large sized words, and fixing the minimum number of letters of this word improves the extraction significantly. However, this procedure is not desirable, as it is not natural and is prone to create unwanted dependencies. In this work, an important goal was to achieve the desired results relying on a minimum number of empirical thresholds.

The main goal of this dissertation was to improve the LocalMaxs algorithm in multiword contiguous extraction, so that this technique may be used by whoever wants to take advantage of it, in a more precise way, and enable it to captureRSW.

(16)

2

R e l a t e d Wo r k

2.1 Introduction

Today’s computer-based and World Wide Web dependent society, brings about an ever- growing amount of digital information, and, consequently, the urge for tools to handle and filter this data, in order to make it easier to be digested. The need to summarize information empowers the creation of such tools, that extract the most relevant Multi Word Expressions (MWE) from a set of documents (corpus), by using enhanced statistic or/and linguistic methods ofNatural Language Processing (NLP). By doing so, key-phrases are extracted to facilitate the handling of the data. The process begins by extracting all the terms and/or phrases from the document, ending in a scoring mechanisms. This whole procedure aims at computing the semantic relevance of lexical units or expressions, selecting the ones that score best.

Table 2.1: Examples of Relevant and Non-Relevant Expressions

EXPRESSION RELEVANT NON-RELEVANT

The United States of America United States of America

Boris Johnson Boris Johnson explained Remainings of dinosaurs

As for the linguistic approach, the conceptualizations of the document takes part, attempting, in the best possible way, the extraction of meaning from terms. Many techniques are applied, an essential one being the use of Part-Of-Speech taggers, which comprises in attaching every term extracted to its morphological value (noun, adjective, adverb, verb, etc.). The most relevant expressions are typically the ones with noun domi- nance, and more specifically, the ones starting and ending with nouns. A list of terms with no semantic value also enhances the extractor, aiding it to disregard phrases which have them as delimiters. Some even try to extract the meaning of a term, using external

(17)

2 . 1 . I N T R O D U C T I O N

dictionaries (such as the WordNet) to retrieve these expressions. Linguistic approaches are highly dependent on the language of the corpora, as idiomatic standards vary widely amongst languages, and is sometimes supported by statistic inferences.

The statistic approach concerns the finding of patterns in the terms extracted, rather than their linguistic perceptions. Firstly, the absolute frequency of the n-gram in the document is retrieved. Some attribute weight to the frequency, for example, an occurrence in the title is heavier, and one in a subtitle or index counts less. If dealing with multiple documents, with various domains, a comparative frequency calculation is ideal to find domain-specific expressions. Another strategy is to calculate the spatial factor of terms, by calculating the mean and the variance of the distance between the words composing an expression. If the terms occur separately in a random fashion in the text, then the variance is high, otherwise the variance is close to zero, which may indicate relevancy. But the co-occurrence of terms is yet another important analysis, which measures the statistical cohesion of two or more words, delivering the glue factor between them, and is a step in determining the relevance of an expression. On top of these methods, algorithms may be used to extract meaningful information, in a more accurate fashion. Here we propose an enhancement to the LocalMaxs algorithm, specially to improve its Precision, when retrieving Relevant Expressions, also known asMWE.

Typically, linguistic-dependent strategies show more Precision in capturing Relevant Expressions, at the cost of being dependent on much more pre-process information than the statistic ones. Hybrid algorithms are seemingly more complete and have even better results, and eliminate, when possible, the shortcomings of the ones mentioned above.

For human beings, the detection of irony, metaphors or sarcasm in texts is considerably easy, but as for inanimate machines, the task becomes much more complex, as they lack in emotion. Sentiment Analysis attempts to change this, capturing, for instance, polarity in texts, trying to convey intention to words. Aligned with linguistic terminology processing, Sentiment Analysis is the ultimate achievement, and consists in extracting true motive and feeling from expressions, capturing the emotions therein. As mentioned in [10]:

’In a nutshell, sentiment analysis or opinion mining aims to identify positive and negative opinions or sentiments expressed in text as well as the targets of these opinions (...).’

In order for this to be achieved, the processing algorithm must not be misguided by ambiguous terms, metaphors, sarcasm and all idiosyncratic characteristics inherited by a language.

NLPtechniques have evolved over the years, with the purpose of improving its Preci- sion in capturing meaningful concepts from digital text. This helps us to have easier and faster access to information, enabling us to get the gist of digital textual information by only reading a handful of expressions.

(18)

C H A P T E R 2 . R E L AT E D WO R K

To automatically extract Relevant Expressions or phrases, a computer must be provided of tools and mechanics if it is to be able to attribute relevance to these terms. As mentioned in Chapter1, the strategies for extraction of semantic meaningful phrases vary from linguistic to statistic, and both approaches are to be further discussed below, separately.

2.2 Linguistic extraction

Teaching and making a computer understand the semantic relevance of a phrase is a widely complex task, even humans sometimes struggle in comprehending their own language. As languages can differ significantly in many aspects, language identification is truly a must when performing Natural Language Processing. An example of a text language identifier is the US patent by Johannes Heinecke, [8], where he says :

’Tools for automatically processing natural language [. . . ] use data sets char- acterizing only one language at a time, such as a lexicon of basic lexical forms constituting dictionary or lexicon entries, morphological rules and grammati- cal rules, for only one language at a time.’

The problem thickens if the document is multi-lingual, reassuring the importance of language identification. Morphological and Phrase construction patterns are not uni- versal, as demonstrated in Table2.2, which is an obstacle when dealing with linguistic extraction ofMulti Word Expressions (MWE).

Table 2.2: Morphosyntactic patterns of terms

PHRASE SYNTACTIC ORDER LANGUAGE

United States of America Adj, Noun, Prep, Noun English Estados Unidos da America Noun, Adj , Prep, Noun Portuguese

Red Zone Adj, Noun English

Zona Vermelha Noun, Adj Portuguese

Tall man Adj, Noun English

Homem alto Noun, Adj Portuguese

To perform a linguist analysis over textual digital data, a fair number of tools are used to process the information, and in the following sub-section we mention the strategies most commonly used to accomplish it.

Syntax is of major importance when extracting expressions, specially when taking advantage of part-of-speech tags (noun, verb, adv, etc.). Taking the first example of Table 2.2, it is clear that language change alters the syntactic order of and expression, showing the importance of knowing the language of the document, as every language has its morphosyntactic rules.

(19)

2 . 2 . L I N G U I S T I C E X T R AC T I O N

For further document processing , every word, or token, is extracted, and sentence/phrase boundaries are defined. Next we present idiom-dependent tools used to extract meaningful terms or expressions from a text.

2.2.1 Stop-word list

Every language has a relatively small list of words named function words, which tend to be the most frequent in a text. For example, in English, these words are prepositions (i.e. the,in,onto, etc.), conjunctions (i.e. as, if, etc.), pronouns (i.e. he, she, etc.), some adverbs(i.e. only, always, etc.), to name a few. Function words are very common in texts and are known to contribute with little to none semantic value or relevance to an expression. These are very useful in text and in speech, but in keyword extraction they fail to impress in importance.

Table 2.3: Common words for a Stop-word list Portuguese English

em to

por with

quem onto

ela from

sempre while

Language dependent algorithms for extraction of relevant expressions rely on a list of these terms, to improve its Precision. One function of this Stop-word list is to prevent the algorithm from retrieving expressions beginning or ending with these function words, as they wouldn’t be relevant.

2.2.2 Part-of-Speech Tagging and Filtering

POS tagging is the cornerstone of any linguistic extractor, being a method that comprises in attaching every term retrieved, to its morphosyntactic value (noun, adjective, adverb, verb, etc.), as seen in Table2.4. Upon tagged, the POS filter may be activated to select sentences, accepting as terms noun sequences containing optional adjectives, prepositions, verbs and so on. This last action reduces the extracted data size considerably.

Tagging may seem a straightforward task, but when dealing with homograph words (words with the same spelling) the process becomes more error prone. Taking the following sentence as an example:

’John houses two friends, Paul and George, whose houses were destroyed in a fire’,

(20)

Table 2.4: Part Of Speach Tagging Example

PHRASE Phrase after POS Parsing

John Doe went home John/Noun, Doe/Noun, went/Verb, home/Noun The games are starting The/Det, games/Noun, are/Verb, starting/Verb

Flying is for birds Flying/Verb, is/Verb, for/Prep, birds/Noun

it is clear that the first occurrence of ’houses’ is a verb and the second is a noun, demonstrating the importance of context in tagging, and shows this is a difficult task for tagger.

2.2.3 Lemmatization and Stemming

Both Lemmatization and Stemming reduce a word to its root or canonical form , enabling better pattern recognition along the text. On the one hand, Lemmatization strips every inflated word of its suffixes and prefixes (i.e. ’goes’, ’friends’,’played’ are reduced to ’go’,

’friend’ and ’play’, respectively), to reduce unneeded diversity of related words. On the other, there is Stemming. One of the most widely used English stemmers is still Porter’s Stemmer [15], even though it was originally developed in 1979. Exemplifying, the words argue,argues,argumentandarguedare stemmed toargu.

Table 2.5: Lemmatization and Stemmatization examples

WORDS Stemmatization Lemmatization

prison/prisoner/imprisoning pris prison television/televise/telecom tel/tele tele

house/housing/housers hous house

These two techniques of text formatting differ. On the one hand, Lemmatization always reduces a term to an existing word, with the help of a dictionary (i.e. WordNet), on the other, Stemming compensates in speed, not always computing real words.

2.2.4 Noun Chunking

Another tool is Noun Chunking, which consists in capturing non-recursive sequences words forming a group, which is equivalent to a noun (i.e. big car), from the tagged text. One relevant chunking scheme may be Adj?-Noun, meaning the sentence extracted may be a non-recursive sequence of terms, ending with a noun, and having zero or more adjectives in the beginning (i.e. Big yellow dog or white house). In [16], the authors achieved an impressive 92% Recall and Precision rate for base-NP (noun phrases) chunks,

(21)

2 . 2 . L I N G U I S T I C E X T R AC T I O N

and remain a reference on the matter.

2.2.5 Taxonomy creation and disambiguation

In order to extract semantic details from terms, a taxonomy of all relevant terms may be created. This means to generate an hierarchical tree that indicates connection and co-occurrence between terms. In [13], the authors approach the issue in four phases:

Term extraction, Term filtering, Disambiguation and hierarchy creation.

Figure 2.1: Concept taxonomy example

The first one is trivial, merely extract the terms and link them to their Part-Of-Speech tag. Term filtering is used to select the most relevant words for a specific domain, and consists in four phases: Domain Pertinence, Lexical Cohesion, Domain Consensus and Structural Relevance.

The Disambiguation stage means to set apart syntactically similar/equal terms with different meaning and to group semantically identical words (synonyms). Here the algorithm resorts to a dictionary to search for synonyms and determine, given the context, if two seemingly identical words have the same semantic value or not. This step is very helpful for the last stage, the hierarchy creation, where a subsumption method is used through the co-occurrence of concepts in the document, creating a tree-like relationship between terms.

In the end, this created taxonomy is compared, by experts, to a certificated one, and Precision and other scores are computed.

2.2.6 Metaphor identification

Another approach to retrieve the meaning of terms or group of terms, given a specific syntactic context, is metaphor identification. In various cases, the meaning of a sentence is not attained through the sum of its parts (words). In fact, sometimes phrases appear to

(22)

group terms that don’t show together very often. Words are sometimes non-compositional, as the conventional meaning of a term may differ from its contextual value. Seemingly, one cannot evaluate each term of the expressions ‘kick the bucket’, ‘He takes after his father’ or ’kiss of death’ separately, as the true meaning is only present if looking to the phrase in a whole.

Taxonomies may lose value if these occurrences are not identified and captured, risk- ing invaluable associations of terms, like ’kiss’ and ’death’, from the last example.

In [12], it is claimed that linguistic tagging does not yet take full advantage of metaphor identification. Training the data using Deep Neural Networks, they claim to have achieved state-of-the-art performance in end-to-end metaphor capturing.

2.2.7 Technique for extremely specific documents

Syntactic and morphological analysis are, sometimes, not robust enough to extract oddly formed domain specific expressions. In [1], the authors state that standard expression extractors do sometimes fail when dealing with very technical terms, that may look unnat- ural from a strictly linguistic point of view. As a consequence, they combined standard linguistics with methods like exogenous disambiguation, which help the taggers in weird looking or unknown terms, through the use of testified terminology. This study was conducted using texts of the biomedical domain, and assisted by a set of external terminology of the same field.

2.2.8 Polarity detection

The final goal of a linguistic-based algorithm is to extract emotion, intent from expression and phrases. For that, accomplishing polarity detection is crucial, where the machine computes if a sentence or word is negative or positively driven. In this technique, words have polarity tags, to indicate whether they convey a positive or negative tone, and then, a sentence or group of sentences are scored, through a sentiment score, and their polarity is determined.

In [6], the authors conducted polarity analysis, with an impressive F-score of 85%.

Although returning very good results, this technique is very complex and extremely reliant on good, preexisting linguistic information.

2.3 Statistic extraction

Extracting semantic meaning from digital text is a heavy and idiom dependent task. Many techniques mentioned in Section2.2use some statistic background and references in their models, in order to strengthen their claims.

(23)

2 . 3 . S TAT I S T I C E X T R AC T I O N

In Relevant Expressions, individual terms are correlated, as patterns can be extracted from their joint occurrences, forming relations that translate in cohesion and correlation between these words.

Intuitive and language independent, statistic facts can provide hints to the discovery of Relevant Expressions. For instance, typically, words with large numbers of syllables tend to be more relevant than others with only one or two. By adding this metric to their repertoire, the authors of [21] were able to improve their extractor’s results. There are many other tools of this kind that will be discussed below.

In this section we mention statistic strategies that aid on the attribution of relevance ton-grams, or groups of terms. We will talk over the main approaches used to convey importance and relevance to expressions, through methods that do not depend on semantic information. These methods are merely numerical, built and calculated via statistic observations.

The programs used to extract these Relevant Expressions don’t know the language of the text beforehand, they are designed to only need unstructured text to work with.

2.3.1 Absolute frequency

This is the first step of most statistical Relevant Expressions extractors. Consists in counting the frequency of each n-gram (n varying from 1 to whatever the author sees fit).

Naively, it is fair to assume that the most frequent terms/expressions are to be considered the most relevant. But this is a rush assumption, bearing in mind that, for example, connectors and prepositions appear in abundance throughout every textual form. And if only regarding one and two gram sized expressions, this case is even clearer, as seen in Table2.6.

In fact, the frequency of words can be used to gather just the opposite: the words with less relevance and meaning. As seen in [3], the author gathers the 201 most frequent words and considers them invaluable in terms of meaning, which can present some prob- lems. In fact, figuring out an accurate Stop-word list using only statistical seems hardly possible, but it will be discussed further in the next section.

Table 2.6: Most frequent one and two grams in English Rank 1-gram 2-gram

1 the of the

2 of in the

3 a to the

4 and on the

5 to and the

(24)

A weighted frequency measure may also be applied as a complement. As an example, an occurrence of ann-gram in the title may indicate importance, and one could favour and inflate that count. However, the count of sets of terms that appear in a footer or in small text, may be disregarded.

2.3.2 TF-IDF 1-gram extration

When dealing with multiple documents, with various domains, a comparative frequency calculation between texts is ideal to find domain-specific terms. TF-IDF has had many formulations along the years, but [22] remains the most robust. This calculation is often attained using the Term Frequency–Inverse Document Frequency (TF-IDF) operator, an- alyzing the occurrences of an expression in a document against the other document(s).

The formula goes as follows:

tf−idf(t, d, D) =tf(t, d)∗log kDk

1 +k{d∈D:t∈d}k (2.1)

Wheretf(t, d) represents the total frequency of termtin documentdandD the total set of documents in thecorpus.

As seen in Equation (2.1), the relevance of termtin documentdis positively dependent of its frequency in that document, and proportional to the logarithm of the inverse of the number of different documents it appears in, meaning the more sparse among documents a word is, the less relevance it is predicted to have.

This method is very popular mostly because it casts out words that are very frequent in all documents, hence preventing them to be considered more relevant than they should.

This method has been popularized for document indexation purposes. As each term has atf−idf score for each document of thecorpus, the most important words of each document can be associated to it, creating a sort of term categorization. As exemplified in Table 2.7, the same word is qualitatively scored for every document as high orlow, depending on the domain of the document.

Table 2.7: Expectedtf−idf score in domain specific documents

WORD/DOMAIN POLITICS GEOGRAPHY SPORTS

Player low low high

Country high high low

Heart low low low

Minister high low low

Mountain low high low

(25)

The TF-IDF extractor is commonly used for single term retrieval. If used in expressions, one must be careful, because it may retrieve bad results, remaining a big fragility.

For instance, the expression ’the Americans were’ may score very high intf-idfif it appears various times in few documents, and is clearly not a relevant phrase. It also depends on having a decent number of different documents, so that the method presents acceptable results (due to relying on the number of documents in thecorpus).

2.3.3 TF-DCF

This techinque, developed in [11], is similar to the TF-IDF, but seems to penalize more a term that appears in various documents. As in the case of TF-IDF, the TF-DCF value for each term must be computed for each document of thecorpus.

tf−dcf_t^(c)= tf_t^(c) Q

∀g∈G

1 +log(1 +tf_t^(g))

(2.2)

In the expression (2.2),crepresents the domain document in analysis,Gstands for the set of contrasting documents andga singular contrasting document. tf_t^(c) andtf_t^(g) stand for the absolute frequency of termtin documentscandg, respectively. TF-DCF will retrieve the relevance of termtfor documentc.

TF-DCF present a peculiarity that the previous algorithm doesn’t: the latter only accounts for the presence or not of the term in other documents, but in [11] the authors considers the absolute frequency of the occurrence. Fact that, on the one hand, eliminates common words from the final result, but, on the other, may over-penalise domain specific words that may appear frequently out of their speciality field, like the word ’President’

(Politics), ’exercise’ (Wellness) or ’kitchen’ (Gastronomy).

When measuring the behavior of the extractor, the authors registered impressive results, delivering around 92% and 86% Precision for selection domain-specific relevant 2-grams and 3-grams, respectively.

2.3.4 Termhood analysis

Following the consideration of a contrasting corpora comparison, the Termhood algorithm (thd) was developed in [9], and presents itself as a novelty, as it does not use the frequency of terms directly. Instead, they propose to use this metric and create a term- ranking in thecorpus. The values for term relevance are computed as follows:

thd_t^(c)= r_t^(c) V^(c)

− r_t^(g)

V^(g) . (2.3)

(26)

wherer_t^(c)is the rank of termtin documentc,V^(c)is vocabulary count inc(the number of different terms) andr_t^(g)andV^(g)are the rank of termtand the vocabulary count in the generalcorpus, respectively. Concerning the rank, the most frequent word has the biggest value, and terms that seldom appear have low value. The result varies from 1 to -1, reflecting the domain importance of a term in a document.

2.3.5 Clustering of terms

By analysing multiple documents with different domains, it is clear that some terms naturally appear more frequently in certain classes of documents. Term indexation via clustering is a viable technique to associate words with certain texts.

As done in [7], terms were extracted using comparative frequency methods (like the ones mentioned above), and then, with the help of likelihood metrics the authors clus- tered these terms using k-means. This is a step towards the objective to aggregate words that belong to the same field of study, reaching 72% score in Precision.

2.3.6 Co-occurrence metrics

To decide whether an expression is relevant or not, some find paramount to calculate the cohesion or glue between the words that compose it. The glue represents the degree of cohesion of an expression, and is based on the pattern of co-occurrence of the words that compose it. These cohesion values may be calculated through a list of metrics: the Mutual Information (MI), theφ², Dice, log-likelihood, to name a few. Each metric may return a different cohesion value as they are different in structure.

TheMutual Information (MI)cohesion score applied toNatural Language Processing (NLP)in [5], computes the cohesion between terms by calculating a glue value which has no limited range.

MI([x, y]) = ln( p(x, y)

p(x)·p(y)) . (2.4)

By performing a derivation and gradient analysis, it is conclusive that this metric is highly dependent on the probability of then-gram (p(x, y)) even when the proportions between the frequency of the bigram (x, y) and the individualxandyunigrams, remain the same. In other words, it is sensitive to the document size. This shortcoming may provide wrong assessments concerning the real glue of ann-gram.

(27)

The Dice metric is another co-occurrence cohesion metric, and this one does not depend on the absolute frequency of then-gram, as long as the proportions between the frequency of the bigram (x, y) and the individualxandy unigrams, remain the same, as can be seen in Equation (2.5).

Dice([x, y]) = 2.f(x, y)

f(x) +f(y). (2.5)

Below, on Table2.8we have an example demonstrating some differences in the results obtained when using Dice andMI.

Table 2.8: Some results of cohesion metrics of MI and Dice Metric p(x)orf(x) p(y)orf(y) p(x,y)orf(x,y) K Result

MI 5/N 5/N 3/N 1 11,69

MI 5/N 5/N 3/N 2 11,00

MI 5/N 5/N 3/N 10 9,39

Dice 5 5 3 1 0.6

Dice 5 5 3 2 0.6

Dice 5 5 3 10 0.6

In Table2.8,N stands for the size of thecorpusexample. Thus, considering that, in acorpus, where the occurrence probabilitiesp(x, y),p(x) andp(y) were to be multiplied byK, theMIwouldn’t just reflect the degree of co-occurrence between the wordsxand y, as it is dependent onK, which distorts the cohesion score retrieved. On the contrary, the Dice metric keeps the same results as long as the proportion betweenf(x),f(y) and f(x, y) remain the same, as it is expected.

Theφ²coefficient is yet another cohesion metric, and generally presents the best results when comparing to the other mentioned above, and is arguably, one of the most used metrics for this purposes. It is a more complex and comprehensive metric, and like Dice, it respects the proportionality of the frequencies. It goes as follows:

φ²([x, y]) = f(x, y).N−f(x).f(y)

f(x).f(y).(N−f(x)).(N−f(y)) (2.6)

wheref(x, y),f(x) andf(y) are the joint absolute frequency of x and y, the frequency of word x and of word y, respectively. N represents the total number of words in the corpus. The upside to this metric, is that it considersN −f(x) andN −f(y), which is the representation of¬xand¬y, respectively.

(28)

But these metrics have one flaw in common, which is their inability to measure the glue ofn-grams when n is greater than 2.

Certainly, a large amount of relevant expressions are bigger than a 2-gram, and in response to that, the authors of [21] came up with a solution, in the context of the proposal of theSymmetric Conditional Probability (SCP)metric.

Comparing to Dice, for wordsx andy, this metric seems to penalize more disjoint occurrences ofxandy(meaning whenxoccurs andydoesn’t, and vice versa).

For the SCP metric the glue for a 2-gram is computed as follows:

SP C([x, y]) =p(x|y).p(y|x) = p(x, y)²

p(x).p(y). (2.7)

wherep(x, y) accounts for the probability ofx being followed byyin the document, andp(x) andp(y) represent, respectively, the probability of occurrence of 1-gramsxand y in the text.

For the part of calculating biggern-grams, the authors apply a Fair Dispersion Point Normalisation, transforming an n-gram in a ’pseudo 2-gram’, where the firstz terms aggregate toxand , the remainingn−zunits, toy. By applying the following change to Equation (2.7), this technique becomes capable of computing the glue of bigger expressions (3-grams or greater):

SP C_f([w₁. . . w_n]) = p(w₁. . . w_n)²

Avp . (2.8)

where Avp represents:

Avp([w₁. . . w_n]) = 1 n−1·

n−1

X

(i=1)

p(w₁. . . w_i)·p(w_i+1. . . w_n) . (2.9)

It is to note that the order of the words is relevant for this assessment. TheSCPglue value ranges from 0 to 1, the first meaning no cohesion and the last showing signs of high correlation. It is to expect that very frequent grams, such as ’in the’, ’with a’ or ’but it is’, compute a glue factor near to zero, as each word in this expressions also appears largely in other contexts.

It is then possible to adapt the other metrics in a way that makes them able to calculate the cohesion ofn-grams wheren >2, by applying the concept of Fair Dispersion Point Normalization, as done with the SCP metric. To do so, in the formulas we substitute

(29)

allp(x).p(y) by theAvp, described in Equation (2.9), allp(x) andp(y) by AvxandAvy, presented below, in Equation (2.10) and Equation (2.11), respectively.

Avx([w₁. . . w_n]) = 1 n−1·

n−1

X

(i=1)

p(w₁. . . w_i) . (2.10)

Avy([w₁. . . w_n]) = 1 n−1·

Xn

(i=2)

p(w_i. . . w_n) . (2.11)

2.3.7 Sequence Expansion extractor

In [14], the authors extract 2-grams, and then use an ingenious way to expand them into bigger grams. The algorithm starts by extracting all 2-grams, and from these, selecting candidates through cohesion, using two metrics: the Mutual Information (MI) as a filter, and the Log-Likelihood as a selector. Next, considering, for example, the candidateW,

’swimming pool’: the algorithm fetches other candidate that matchW in either sides, finding ’large swimming’, then aggregates ’large’ with ’swimming pool’, building a pseudo 2-gram in order to compute its cohesion. If this value is greater than the cohesion ofW minusK (defined constant), then the 3-gram ’large swimming pool’ is added to the final list, and this method continues recursively until a fixed point (maxn-gram length).

The extractor is then evaluated in terms of Precision, Recall and perplexity, the latter measuring how well a model produces linguistic content.

By using MI as a filter, this approach will be highly dependent on the size of the documents, which is not desirable, and can lead to bad results.

2.3.8 Xtractor

This method forMWEretrieval differs in a key concept from the others: it extracts collocations, which are composed by words that go together very often, and the aim is not only relevant key-phrases.

Until this point, we have only discussed contiguous sets of words that may or may not be relevant. Xtractor proposes the extraction of contiguous or non-contiguous collocations, as said in [19] and involves three phases.

The first comprises in setting lexical relationships between all 2-grams separated by, at most, five words, based on frequency and rigidness of the relative position of the terms.

The next stage, multiple-word combinations and complex expressions are identified, and in the last one, 2-grams from the first phase are filtered and scored, using statistic and

(30)

Table 2.9: Examples of Relevant and Non-Relevant Expressions

WORDS COLLOCATION RELEVANT-EXPRESSION

make an effort United States of America

fast food rely on Emily Blunt

parsing methods. This algorithm reaches a Precision value for identifying meaningful collocations of 80%, with an astounding score of 95% in Recall. As this algorithm captures non-contiguous collocations, it returns a lot of common expressions, and it is only natural that a lot of them are not considered Relevant Expressions.

2.3.9 Concept Extractor

Concept Extractor, an approach proposed in [20], attempts to calculate the cohesion and, hence, relevance of terms also taking into account non-contiguous occurrences. The authors believe that relevant expressions are delimited byconcepts, being these concepts meaningful and important terms.

In a nutshell, term/expression relevance is extracted by syncing via three methods:

Fixed distances, Specificity of single word concepts and Specificity of multi word concepts.

The first entails measuring the relative distance between a pair of terms. If this distance is short and consistent, the calculated Relative Variance will be high, hinting these terms areConcepts.

The second phase determines the specificity of a term, and the fewer number of terms a word relates to, the more specific it is. For instance, it is to assume that the word

’Minotaur’ relates to fewer concepts than the 1-gram ’House’, hence, the authors calculate more specificity to the first.

The last phase makes use of the terms’ Relative Variance and Specificity, to make conclusions towards a MWE, attributing more relevance to expressions which terms score higher in phase 1 and 2.

This proposed extractor scores very positively in terms of Precision and Recall, having one fragility, which is depending on empirically obtained thresholds to determine whether a term or expression is relevant or not.

(31)

2.3.10 YAKE! (Yet Another Keyword Extractor)

Yake!, developed and presented in [4], builds upon statistical features and extracts relevant expression from a single document. It comprises in 4 phases: Text processing, feature extraction, single term weighing and candidate keyword generation.

• Term Extraction: Firstly, the text is split into a set of 1-grams, using blank spaces or punctuation as a delimiter. This initial step creates a list of all the words in the text, that are to be used in the following phases.

• Feature Extraction: The algorithm computes five characteristics over each term, which are to be used to score each one of them. The following heuristics are calculated:

Casing which takes into account the frequency of term beginning with upper case or acronyms. This method increases the relevance of these words, over the lower case one, as they are seen as more important;

Word Position that considers that a word’s relevance is related to where on the document it appears (calculated using the Median), turning less important towards the end of the document;

Term frequency feature that counts how many times a term occurred in the text, normalizing this number to avoid bias towards very large documents;

Relatedness to Context computes how many different terms the candidate word has (checking the 3 neighbours from both sides). The more it has, the less relevant it seems to be. This should be intuitive, as relevant terms tend to be found in the same context, being involved by the same words.

DifSentence quantifying how often the candidate term appears in different sentences (also normalized).

• Individual Term Weighing: Aggregates all features in one score, theS(w). High values ofS(w) translate in word irrelevance. The formulation of this score goes as follows:

S(w) = W_rel×W_{P os} W_Case+^W_W^Freq

Rel +^W_W^{Dif Sent}

Rel

. (2.12)

whereW_rel,W_pos,W_Case,W_{f req}andW_DidSent, stand for Relatedness to Context, Word Position, Casing, Term frequency and DifSentence, respectively.

(32)

• Candidate Keyword Extraction: After every word is scored, Yake! attempts the extraction of bigger grams, using theS(w) score of each word to deliver relevance to an expression. This metric also uses a stop list, preventing the extraction of relevant expressions start or end with those.

For the evaluation, the authors compared its results against several metrics, one being, the above mentioned, TF-IDF. This method attained an average Precision of 20%, and performed better than the other metrics it was being compared against. Having in mind it is a very light algorithm and only uses a single document, the results obtained are adequate. However, it was reported that this algorithm is very sensible to the document size and performed much worse in non-English documents.

2.3.11 LocalMaxs Extractor

This extractor’s first stage consists in formatting the text, separating every word by a blank space. Then the algorithm saves every n-gram in the corpus in a dictionary, to simplify the process of accessing it. In the dictionary, the key is then-gram and the value is its absolute frequency. The LocalMaxs extractor then takes ann-gram’s cohesion value, obtained with one of the above mentioned metrics (SCP,φ², ...), and compares it to its neighbours’. As affirmed in [18], this method relies on the local maximum of the glue values for every n-gram (n ≥ 2), consequently avoiding depending on experimentally calculated thresholds, in this case. However, in order to be relevant, an expression must have at least an absolute frequency of 2, which represents an imposed limit, which is something that can be improved.

The algorithms runs as follows:

∀x∈Ω_n−1(w)∧ ∀y∈Ω_n+1(w) ,wis a Relevant Expression if:

(length(w) = 2∧g(w)> y)∨(length(w)>2∧g(w)> x+y

2 ). (2.13)

whereg(w) represents the glue value for the expressionw. For instance, let us consider the expressionw, ’United States of America’, which is evidently a relevant expression. In this case, the neighbours Ω_n−1 is a set containing the glue values of ’United States of’

and ’States of America’. And as forΩ_n+1, it contains the glue values corresponding to all possible expressions composed bywplus one more single word before or afterw, such as

’theUnited States of America’ or ’United States of Americais’.

In Fig.2.2 it is plotted the neighbourhood of ann-gram against itsSCP value. The expressions ’energy savings’ and ’energy savings in the public sector’, are selected as relevant in LocalMaxs, as their vicinity presents lower cohesion values.

(33)

2 . 4 . A P P L I C AT I O N S O F R E L E VA N T E X P R E S S I O N E X T R AC T I O N

Figure 2.2: Example of neighbourhood glue values in the context of LocalMaxs algorithm

As for the scores registered,LocalMaxsscores around 70%, which is good for a purely statistic and threshold-less extractor. But this technique can only extractn-grams, for n >1, so it cannot extract single words.

In [18], the authors also conduct a non-contiguous relevant concept extraction. This process consisted in, after applying a glue metric to an-gram, striping the gram from it and then assessing the ’cohesion’ loss. In the end, after applying LocalMaxs, the highest ratedn-grams, in terms of cohesion, were typically those which the blank space could be filled with various different words.

The local maxima strategy uses a very interesting methodology, but allows the entry of some garbage expressions, that bear no meaning and should not be present. By condi- tioning this entry, we intend to trim the selected Relevant Expressions, leading to better Precision and hopefully the same Recall.

In this work we propose enhancements to LocalMaxs to tackle its unimpressive Pre- cision in contiguous relevant expressions extraction and its ineptitude to select 1-gram relevant terms.

2.4 Applications of Relevant Expression extraction

Text mining has shown to be of great use in many fields of human studies. The direct and most obvious application is the extraction of keywords from text, which helps readers to have a small preview of what is displayed.

This procedure also plays an important role in Document Classification, using extracted meaningful expressions or terms as features to group documents that belong to the same domain. In an era where available digital information is growing faster than ever, this process can prevent users from wasting time looking for the main theme of a document, also playing a part in cataloguing them.

A great example is the Lymbaclassification algorithm, which is used in automatic email processing. An enterprise that receives a great number of emails and struggles to

(34)

attend to all of them in due time, may turn to this software as a filter, that, based on found keywords, archives messages that require no reply and redirects the ones that do to a specific department. Apart fromLymbathere are other business oriented software tools that are based on relevant keywords extraction, such as IBM Watson, Amazon Comprehend and Aylien, to name a few.

Figure 2.3: Example of Document Classification

Automatic Taxonomy Construction processes have proved to have interesting applications, such as next-word-prediction (attempted in [2] with great results) , in a ’fill in the blanks’ fashion, being very useful in search engines. Knowing the relations between terms, upon the insertion of a word or expression by an user, engines can try to predict, in a decently accurate way, the next word or words, making the query faster for the user.

To improve the results retrieved by the query,Document Indexationturns to be helpful, linking and associating terms and expressions, explicit or not, to documents, making this search faster and more efficient.

The medical field also benefits a lot from the progress in text processing, due to, for example, the increasing tendency of people turning to the World Wide Web to search for health related issues. In [17], the authors utilize the correlation and semantic connection between terms to associate Health concepts, Risk Factors and Symptoms to Diseases.

Great progress is being made in the Genetic Field, using extracted Terminology to detect genetic variants and mutations, as shown in [23].

In social media, extractors are used to moderate commentary sections, basically evaluating the polarity of expressions and words in comments and acting when needed. This technique may help the website deleting offensive comments and even ban users from their platform.

Automatic Chat Boxes are commonly used in websites, to process commonly asked questions and provide answers to users, ensuring faster costumer service and lower em- ployee count. The text inserted in the chat boxes is processed, looking for keywords that may help the virtual assistant become aware of what issues are troubling the user.

The world is swarmed with information, data reproduces at a speed us humans can´t follow efficiently. Further applications of text mining with focus on terminology retrieval

(35)

2 . 4 . A P P L I C AT I O N S O F R E L E VA N T E X P R E S S I O N E X T R AC T I O N

will surely emerge sooner rather than later. Not only new applications, but also approvals on the ones already in existence.

(36)

3

P r o p o s e d i m p r o v e m e n t s t o L o c a l M a x s

As mentioned previously, even though LocalMaxs algorithm extracts Relevant Expres- sions with decent Precision, it is not impressive. It is restricted to only extractingn-grams withnbeing greater that 1, thus unable to retrieve single words. This algorithm uses the cohesion score of an expression to extract it. This cohesion represents the sense of ’glue’

between the words of an expression.

We propose to expand the model, taking advantage of this cohesion metric but also adding other important factors when attributing relevance to a phrase. In this chapter we present some additions to LocalMaxs, aiming to improve its Precision and not damaging its recall results in collecting MWE and to enable it to also gather single meaningful words.

3.1 Improving Multi Word Expression extraction

In order to improve this algorithm, we propose a several number of enhancements, based on changing the criteria for considering an expression relevant or not. Considering the new LocalMaxs relevant expression extractor, it will be divided in two steps:

• Step 1: The extraction of relevantn-gram candidates from the unstructured text, recurring to the unaltered LocalMaxs algorithm, using word cohesion to hint importance and meaning;

• Step 2: Build a Stop-word list, which consists in function words, that bear little to no meaning, using merely statistic inferences;

• Step 3: Filter the extracted expressions from step 1, using the results obtained in step 2;

(37)

3 . 1 . I M P R O V I N G M U LT I WO R D E X P R E S S I O N E X T R AC T I O N

3.1.1 LocalMaxs application for candidates extraction

To process a digital sequence of terms, the text must me formatted in a way that the algorithm knows where a word begins and where it ends. Hence, the first stage consists in applying basic formatting over the fullcorpus: every set of words are to be separated by a blank space from commas, periods, parenthesis and other non-letter typing forms, and all letters are converted to lower case, as shown in3.1.

Table 3.1: Basic char formatting

RAW TEXT He wasn’t, they thought, after them.

FORMATTED TEXT he wasn ’ t , they thought , after them .

This insertion of the space character does not change the semantics of the text, while enabling a more correct counting of the true term occurrences. Ex: in the textJohn and Mary ate some food. John, who was hungry, eat too much., the word John would only be counted once, even though it should be captured twice, as inJohn and Mary ate some food . John , who was hungry , eat too much .

Now that all words are delimited by blank spaces, the program goes through the document and saves all1-gramsin a dictionary and alln-grams (ngoing from 2 to 7) in another, having the term/expression as the key and its absolute frequency as one of the values. The one-word and multi word dictionaries are further mentioned asD1 andD2, respectively.

Then, proceeds the calculation of the cohesion of alln-grams present inD2 using the SCP, Dice,MIandφ²cohesion metrics. Notice that the algorithm works with just one of these metrics, at a time. Although the results obtained with these different metrics will be used for evaluation. This calculation is followed by the computation of theΩ_n−1and Ω_n+1values for everyn-gram, as explained in Subsection2.3.11.

D2 is a more complex structure. For each n-gram, it holds two lists of values that describe it: the first keeps its absolute frequency, its cohesion value and the maximum cohesion value of all the corresponding n+1-grams, the second has two values, which keep count of how manyn+1-gramsthe expression in question has, in both extremities.

In order for an expression to be a candidate, this values from the second list must be greater than 1. This criterion improves the Precision of the extraction.

In order to eliminate typographical errors, we opted to disregard expressions that present an absolute frequency of 1. This does not seem to delete expressions that are relevant in the document, as these, generally, appear more than once. In fact, this criterion proved to be more beneficial for the Precision measure than detrimental for the Recall.

(38)

C H A P T E R 3 . P R O P O S E D I M P R O V E M E N T S T O L O C A L M A X S

3.1.1.1 Forbidden characters

Before the next step, a simple filtering must take place to eliminate obvious unwelcome expressions. After extracting the candidates, some expressions may still contain forbidden characters, such as commas, parenthesis, hyphens, to name a few. In this filtering step, the algorithm runs over all extracted expressions and eliminates every one containing this characters, as no sentence populated by them is relevant.

3.1.2 Automatic identification of Stop-words – The Stop-word List

This step comprises in trimming the list retrieved in the previous phase. After the extraction, the list ofMWEis still filled with unimportant and meaningless words. These candidates are to be pre-processed, looking for those that do not respect the conditions imposed, which will be explained below, and discarding them if they don’t comply. In order to achieve this, we developed a technique that enables our program to figure out the most prevalent function words, without having to resort to a dictionary, part-of-speech tags and even knowing which language we are dealing with. We called this technique

"Context Analysis".

3.1.2.1 Wordcontext analysis– Finding thethreshold

Every language needs functional words to connect and articulate its writing and speech.

Articles, prepositions and some adverbs are an example of such words, which tend to bear very little meaning, and are irrelevant for short summaries and conveying ideas using few words.

As mentioned before, in [3], the author attempts to gather such words using absolute frequency, claiming that the most common words in a text are of little significance. Intu- itively, this seems to be rather accurate, as empty words, such as ’the’, ’in’ or ’a’, appear very insistingly in documents.

However, obvious mistakes may come from this approach, as important words can easily be captured by this list, and the author has to specify how many of those frequent words he wants to gather. He always captures 201 whatever thecorpus size, which is dangerous to commit to. In fact, the number of words captured should vary according to thecorpussize.

For example, in a domain-specific document, it is only natural that important domain related terms emerge repeatedly, and to base their exclusion solely on the frequency of their occurrence seems rather rudimentary and error prone.

Instead, we propose a more elaborate mechanism to spot unimportant words, regardless of the language of thecorpus. This approach tries to capture the very reason why these so called ’function words’ bear very little importance. They all have something in common, they seem to appear many times, scattered around the texts, and as they