Natural Language Processing - Diagnosis and Prognosis of Occupational disorders based on Machin

One of the interesting fields ofAIwhich was introduced in 1950 isNLP[111]. The way we communicate through language has radically changed over the past few years. With technological development, new instruments have been used by humans to communicate.

Communication between humans and computers has never been as broad or as global as it is now. How does a computer comprehend the structure, organization, and even meaning

of what is written? In this regard,NLPis a computational field that was developed to serve as a bridge between natural language and computer language. It can be applied to all domains, from our day-to-day tasks in auto-completion or spam filtering, to medical text analysis. The usage ofNLPincreased with developments by the main industries, such as Google, Facebook, Amazon and Netflix, as well as the main strategies applied in the context of Biomedical text mining. In general,NLPtechniques include text generation, text classification, text processing, and text comprehension. First, a list of the main terms will be introduced to the reader.

2.6.1 Definitions

A definition of the terms that will be needed before presenting works and the methods used to perform them.

• Character: A character is the smallest textual element. Together with other charac-ters it is possible to create words.

• Word/Token: A word/token is an element of a sentence.

• Sentence: A sentence is a set of words or tokens organized sequentially.

• Document: A document is a piece of text with a collection of words or tokens that are used to build sentences. It can be made of only one sentence, or multiple sentences.

• Corpus: The corpus is a collection of text material (group of documents). It repre-sents the higher level of textual information. This collection is typically annotated and used for machine learning tasks.

• Dictionary/Vocabulary: A set of all words that appear in at least one document of the corpus.

• Grammar: The set of rules that define a language, that is, how words are organized to build sentences and how words are inflected.

• Topic: A topic is a class/label that belongs to the document. A document can have several topics depending on the words has. Another definition for topic can be words that co-occur.

Currently,MLstrategies, both supervised (using sample input-output pairs to learn a function that maps an input to an output) and unsupervised (algorithms to find patterns in data sets with unclassified or unlabeled data points), dominate the main strategies for NLPapplications. Not onlyNLPcan be used in medical databases, but it can also be used in a broad set of applications such as machine translation, grammar check, spam filters, sentiment analysis, among others.

Despite substantial progress in English basedNLPand its medical [111] success in recent years, the availability of models in other significant languages, such as Hindi [145], Arabic [63], Portuguese [51], or Spanish [147] is limited — not to mention the other approximately 6,500 languages spoken around the world. The situation is even more complex with domain-specific models (specialized for medical, technical, or legal jargon, or non-natural languages such as code), where only a few high-quality models besides English exist. Many congresses and conferences have been brought together to identify transformational opportunities associated with developing localized and/or specialized language models and discuss the challenges of bringing this bleeding-edge technology to medical production, as well as the impactMLis having on their medical decision-making.

For example; based on unbalanced data onAdverse Drug Reactions (ADRs)in real Electronic Health Records (EHRs) from the Spanish population, the hybrid approach presented a combination of rule-based and machine learning-based techniques. In a highly skewed categorization environment, recall and precision complement the elements for knowledge-based and inferred models, according to both intra-sentence and inter-sentenceADRs[27]. DLhas been applied to clinical notes in Spanish and Swedish for medical named entity recognition [166]. Other studies can evaluate medical texts like [119] producing rhetorical structures named DiZer of scientific texts based upon the Rhetorical Structure Theory [127] in Portuguese language. We’ll look at which fields benefit the most fromMachine Learning-based Natural Language Processing (ML-NLP) in a part3.4.

2.6.2 Theory, Methods and Tools of NLP

An explanation of the methodologies used to accomplishNLPwill describe in this section.

The section outlines and describes the methods used by eachNLPphase in tackling a variety of problems in this domain, including Pre-Processing, Feature Extraction, and DL.

It is necessary to be aware of any existing ambiguities when analyzing a language.

These are not always ambiguous because of the text used, but rather because when doing a specific analysis, conjugated words, as well as words and sentence connectors, are not necessary and can produce noise. For example, high-frequency words in the English language include "the,""and,""or,"and other terms that are irrelevant to most analyses. It’s also possible that employing verb conjugations or gender isn’t necessary. Pre-processing the text entails taking into account all of these features of the language, and it is an ex-tremely important stage. Cleaning the text, deleting unnecessary words and punctuation, and reducing the content to its most basic form are all part of the process and changing it into its simplest form to extract key information and be processed appropriately are all part of the process. The most important text pre-processing methods are discussed in this section.

2.6.2.1 Tokenization

Tokenization is the process of breaking down text into tokens. These could be words, characters, or even subwords. For example, the phrase "I am supermotivated"can be divided into two parts: "[I am, supermotivated]"and "[I am, super, motivated]."This phase is critical for preparing the data for further processing processes.

2.6.2.2 Stop Word Removal

Words that aren’t crucial for processing are referred to as stop words. In reality, these words are classified as noise and can affect the results. As a result, each language has a list of stop words that are eliminated from the text before the next processing steps. These are usually relatively common words that can cause the system to become confused when comparing texts.

2.6.2.3 Lemmatisation/Stemming

As mentioned, languages use conjugation, namely for gender and verbs, which requires making variations of a word, and in some cases, with irregular conjugations, changing the root form of the word completely. Having the root form of a word is better for post-processing methodologies, since words conjugated differently will be treated as the same.

Considering the sentence "I like to have something that he likes", "like"and "likes"are orig-inated from the same root word "like", and should be considered as "like"for the benefit of improving the analysis of some methodologies. There are two ways of simplifying this text representation: Lemmatisation and Stemming.

Lemmatisation is the process of grouping together the inflected forms of a word con-sidering the lemma, that is the dictionary form (e.g: "to walk", "walked"or "walking"have the same lemma "walk". Stemming differs from Lemmatisation in the sense that the meaning is not inferred as in the case of the lemma. A stemma is the root form of a word, e.g. "world"is the stem of "worldwide"and "worlds".

2.6.2.4 Part of Speech(POS)

Text mining usePart of Speech Tagging (POS)as both a pre-processing tool and a task.

Because it can assign tags to words and assign a precise relationship to words in a sentence, this method is very useful for natural languages. Features may be retrieved based on a class of tags given to words, and text can be parsed based on the tags rather than the words themselves, due to this tagging (e.g: one might want to search for all verbs in a document).

2.6.2.5 N-grams

Text is a sequential structure in which the next word is dependent on the previous ones to some extent. As a result, having the text formatted in a way that we can understand these

dependencies is critical when developing probabilistic models. The N-grams structure was created specifically for this purpose. N-grams are a systematic method of arranging text by grouping N tokens that are followed in a sequence with a total overlap. For example, a bigram (2-gram) of the sentence "I am doing great"would be ["I am", "am doing", "doing great"].

There are tools that combine both methods to search for text based on feature structure rules, such as pyRATA - a Python Rule-based feAture sTructure Analysis tool, in addition to parsing text using regular expressions or rules. The search has several levels in this situation, because regular expressions can be applied on tags of words, and also on a text structure rather than the text itself.

2.6.2.6 Text Parsing

The process of developing rules for splitting text segments into smaller segments is known as parsing. These rules are defined using a specific grammar. Regular grammars, a so-phisticated text pattern parsing method from which regular expressions can be defined, are an example of such a set of rules. Specific operators can be used to match text. Reg-ular expressions also contain specific search algorithms, such as "lookahead"and "look-behind,"that make it easier to discover specific matches. There are also tools for text generation based on regular expressions that can be used for different purposes.

2.6.2.7 Grammar Inference

There are grammar inference methods, which are used to infer the rules and thus the structure and language of a piece of text, in addition to text-parsing methods based on grammars, in which text is parsed according to a set of criteria. The Sequitur-algorithm, which infers context-free grammars, andRegular in Positive and Negative (RPNI), which infers regular grammars, are two examples of such algorithms.

2.6.2.8 Feature Extraction

A phase for information retrieval is often performed in order to process the textual in-formation and offer responses for all of the problem areas. The main goal of this step is to analyze the content of the corpus by offering methods for converting textual data into quantitative data. These can include approaches such as word frequency detection, n-gram word frequency detection, and word connection detection using embeddings, among others. This section explains the majority of these feature extraction strategies.

2.6.2.9 Bag of Words

The number of word/token occurrences in each document is calculated using theBag of Words (BoW)model. The data is commonly kept in a matrix, with the rows represent-ing the documents and the columns representrepresent-ing each unique word found. The model

calculates the term frequency/raw counts, tf, mathematically, which is defined as:

tf_t,d = f_t,d P

t^′∈d

f_t^′_,d (2.5)

being t the term that exists in all documents, d the document, t’ the term that belongs to document d. In addition, instead of a single word, the term can be an N-gram. This method is used to provide more contextual information for words in a variety of diffi -culties. The features collected in this example are entirely statistical, although they can provide a useful measure of document differences.

2.6.2.10 Term Frequency/Inverse Document Frequency (TF-IDF)

One of the BoW model’s acknowledged flaws is that it evaluates a word’s significance based on its frequency in all documents, rather than considering that terms that appear in all papers may be less relevant. TheTerm Frequency/Inverse Document Frequency (TF-IDF)model, like the BoW model, raises the relevance of a word by counting its raw occurrences while decreasing its importance in proportion to the number of documents that contain that word. The model is described by being a ratio between the tf and the inverse document frequency (idf), which is calculated as follows:

idf(t, D) = log N

|d∈D:t∈d| (2.6)

being D, the set of documents and N the total number of documents. The final equa-tion of theTF-IDFmodel is the following:

tf idf(t, d, D) =tf(t, d)·idf(t, D) (2.7) 2.6.2.11 Words/Paragraph Embedding

For feature extraction, the use of BoW and TF-IDFmodels has two major drawbacks:

(1) the matrices are extremely sparse; (2) the link between words is only provided by n-grams, which does not capture the entire relationship and increases the sparsity of the matrices as the number of n-grams chosen rises. Word Embeddings, which are a vector representation of a word in a continuous N-dimensional space, have been introduced to solve these problems and provide a numerical representation of words that reduces the sparsity of statistical features while also providing more meaningful relationships between words.

A word embedding generates a similarity measure based on context and target words, implying that the relationship between words entails more than just edit similarity, but also context association and meaning. In this sense, even though they have a large edit distance, words like "Queen"and "King"should have some similarities. The same is true for words like "cat"and "kitten,"which should have equivalent vector representations de-pendent on context. When it comes to learning word embedding, there are two main

approaches: (1) A count-based method based on the matrix factorization of a global word co-occurrence matrix, and (2) A context-based method based on a supervised process that learns the word embedding representation [168]. Next are presented Context-based models.

The training of a neural network with a single hidden layer with the goal of maximizing the probability of the upcoming words given the previous words is required to construct context-based word embedding. The use of this vector form of words allows for the ex-amination of word similarity using proximity in N-dimensional space. A simple neural network with a single hidden layer and a softmax output layer is shown in Figure2.12.

The chance of a word being a context word for a given target word, with the target word as an input, is the output. The Skip-gram model is the name of this model. The network can also be used in reverse to predict the likelihood of a specific target word using a set of context words as inputs. These models were defined by Mikolov et al. in 2013 [105]. Both models can be used to compute distances between word vectors. This process requires the usage of pre-trained models, which typically are available from the google word2vec model, or using current tools from Gensim and Spacy that use these pre-trained models and can also be extended to customize models and train them with personal text.

Additionally, there is also a model that combines both count-based and context-based approaches, developed in 2014 by Pennington et al. The model is calledGlobal Vector (GloVe)which uses a function that depends on the ratios of co-occurrences probabilities [168,120].

2.6.2.12 Deep Learning Approach - Transformers

The usage of deep learning withNLPwill not be covered extensively. Main current appli-cations ofNLPwith deep learning use Transformers, based on theBidirectional Encoder Representations from Transformers (BERT)model [39]. These have outperformedRNN in most ofNLPtasks, which were the previous models with section3.4.

The work, published by Google AI shows that a language model trained in a bidirec-tional way can have a deeper sense of language context and flow than the single direction opted by RNN. The process involves using an attention model (Transformer) to learn the relationship between words in text. The fact that the Transformer is bidirectional, makes it able to read the entire sequence of words at once, which provides a greater contextualization of the word’s surroundings. The model uses an encoding layer with a classification layer on top, as illustrated in Figure2.13.

The input layer shows masked values, which are used as the prediction targets to train the model. TheBERTprocess has 2 main steps: (a) Pre-Training and (b) Fine-Tuning. In (a) the model is trained with unlabelled data by means of 2 unsupervised tasks: (a.1) Masked Language Model and (a.2)Next sentence prediction (NSP). In (b) the model is trained with labelled data for the specific task at hand. The input data for the model is made of the combination of token, segment and position embeddings, as presented in

x₁ x₂

x_i

x_v

y₁ y₂

y_j

y_v 0

0 . ..

. . .

0 0

0 . ..

. .. ..

.. .. . h₁ h₂ h₃

h_N Matrix W

Vector of word i

Matrix W’

Vector of word j

Context matrix Embedding matrix

Hidden

X N

V₌ N =

Input Output

Softmax

N-dimension vector

(a) Word2vec based on skipgram

y₁ y₂

y_j

y_v

0 0

1 0 . . .

0 0

0 .. .

. .. ..

.. .. . h₁ h₂ h3

h_N

Matrix W Matrix W’

Hidden

X N

V= N

avg

N = Input

Output

Softmax

N-dimension vector (Averageof vectors

of all input words) 0 V

1 0 . . .

0 0

1 0 .

. .

. X

(b) Word2vec based on Cbow

Figure 2.12: Skipgram and Cbow models of the word2vec approach. Input and output words (w_i andw_j), vocabulary size ofV, word-embedding matrixWand output context matrixW’. Images taken from [168]

Figure2.14.

Transformer encoder

Classification Layer: Fully-connected layer + Norm

O₁ O₂ O₃ O₄ O₅

W’₁ W’₂ W’₃ W’₄ W’₅

Embedding Embedding to vocab + Softmax

W₁ W₂ W₃ [MASK] W₅

W₁ W₂ W₃ W₄ W₅

Figure 2.13: Transformer Encoder Layer with masked embeddings [76].

E₀ E₁ E₂ E₃ E₄ E₅ E₆ E₇ E₈ E₉ E₁₀

E_A E_A E_A E_A E_A E_A E_B E_B E_B E_B E_B

E_[CLS] E_my E_[MASK] E_is E_cute E_[SEP] E_he E_[_MASK_] E_play E_ing E_[SEP]

[CLS] my dog is cute [SEP] he likes play ing [SEP]

Input Token Embeddings Sentence Embedding Transformer Positional Embedding

[MASK] [MASK]

Figure 2.14: BERT input representation [39].

3

L i t e r a t u r e R e v i e w

No documento Diagnosis and Prognosis of Occupational disorders based on Machine Learn- ing Techniques applied to Occupational Profiles (páginas 46-55)