Automatically generated summaries of sports videos based on semantic content

(1)

FACULDADE DE

ENGENHARIA DA

UNIVERSIDADE DO

PORTO

Automatically generated summaries of

sports videos based on semantic content

Miguel André Almeida Tomás Ferreira de Barros

Mestrado Integrado em Engenharia Eletrotécnica e Computadores Supervisor: Luís Filipe Teixeira

Second Supervisor: Pedro Santos

(2)

(3)

Abstract

With exponential growth of available information online it became humanly impossible to un-derstand or make decisions when faced with excessive amounts of information, a phenomenon known as Information Overload. In order to allow the extraction of knowledge from the textual data, several approaches were developed, such as Information Retrieval (IR), Document Cluster-ing, Information Extraction (IE), Natural Language Processing (NLP), among others.

Social networks changed the way contents are created and consumed. Nowadays, users feel the need for instant access to contents, and content creators had to readjust the way they create and distribute contents in order to survive in the market. As an example, social networks are today, the main source of online news. This also applies to sports contents.

Sport has been a part of our lives since the beginning of times, whether as spectators or par-ticipants. The diffusion and proliferation of multimedia platforms made the consumption of these contents easily and conveniently available to everyone. Sports videos appeal to a large population all around the world and have become an important form of multimedia content that is streamed over the Internet and television networks.

Moreover, sport content creators today use social networks to provide users with innovative contents in the least time possible. These contents can be, for example, live commentary of a sporting event or a game summarization in the form of text or video. Currently, sports contents are created manually by studio professionals, which is an obstacle to reach users in a short time. To overcome this, the creation of automatic tools for the creation of content are required.

MOG Technologies is a company that provides its customers products tailored to their needs. The lack of a tool that takes advantage of all information available to help creators to develop sports content automatically, to be used in the social networks, led to the proposal of this Disserta-tion. The main objective was to develop a proof-of-concept tool capable of summarizing football matches based on semantic content.

The commentator’s speech was converted into text to extract events. Several machine learning models and a pre-trained deep learning model were then tested to classify sentences into important events. To train the model, a dataset was created, combining 48 transcripts of games from differ-ent television channels. In addition, phrases from 72 games provided by Google Search were extracted. The final dataset contained 3260 phrases. A Google API, called Google Cloud Speech-to-Text, was used to convert speech into text. After the detection of events, textual information present in the video was extracted to identify the player involved in the action. Finally, a summary was created for each correctly identified event. Each entry of the summary is composed by the event name, the time when the event occurred and the players’ names.

The results obtained for the best model, 92 % of accuracy and an average F1 score of 0.88, show that the developed tool is capable of detecting events for football matches, with a low error rate. Also, combining multiple sources, it was possible to increase the performance of the model. It is important to notice that the work developed will allow MOG Technologies to develop future projects based on the proposed solution.

(4)

(5)

Resumo

Com o crescimento exponential da informação disponível online tornou-se humanamente impos-sível perceber ou tomar decisões perante uma quantidade excessiva de informação, um fenómeno conhecido por Information Overload. A fim de extrair conhecimento a partir da informação tex-tual, várias abordagens foram desenvolvidas, tais como: Information Retrieval (IR), Document Clustering, Information Extraction (IE), Natural Language Processing (NLP), entre outros.

As redes sociais mudaram a maneira como os conteúdos são criados e consumidos. Atual-mente, os utilizadores sentem a necessidade de ter acesso instantâneo aos conteúdos, e os criadores de conteúdo tiveram de mudar a maneira como criam e distribuem conteúdo para sobreviver no mercado. Por exemplo, as redes sociais hoje em dia, são a principal fonte de conteúdo de infor-mção. Isso também se aplica ao conteúdo desportivo.

O desporto faz parte das nossas vidas deste o começo dos tempos, seja como espectadores ou participantes. A difusão e proliferação de plataformas de multimédia tornaram o consumo destes conteúdos acessíveis de forma fácil e conveniente para toda a gente. Vídeos de desporto atraem uma grande população por todo o mundo e tornaram-se uma importante forma de conteúdo multimédia que é distribuído pela Internet e redes de televisão.

Além disso, os criadores de conteúdo usam as redes sociais para fornecer aos utilizadores conteúdos inovadores no menor tempo possível. Estes conteúdos podem ser, por exemplo, comen-tários ao vivo de um evento desportivo ou um resumo de um jogo na forma de texto ou vídeo. Atualmente, os conteúdos desportivos são criados manualmente por profissionais de estúdio, o que é um obstáculo para alcançar os utilizadores em pouco tempo. Para ultrapassar isso, são necessárias ferramentas automáticas para a criação de conteúdo.

MOG Technologiesé uma empresa que providencia os seus clientes com produtos adequados às suas necessidades. A falta de uma ferramenta, que tire proveito de toda a informação disponivel, para ajudar os criadores a desenvolverem conteúdos desportivos automaticamente, para serem uti-lizados nas redes sociais, levou à proposta desta Dissertação. O objetivo principal era desenvolver uma ferramenta como prova de conceito capaz de resumir jogos de futebol baseados em conteúdo semântico.

O discurso do comentador foi convertido para texto para extrair eventos. Vários modelos de machine learning e um modelo pré-treinado de deep learning foram testados para classificar as frases em eventos importantes. Para treinar o modelo, um dataset foi criado, combinando 48 transcrições de jogos de diferentes canais de televisão. Além disso, foram extraídas frases de 72 jogos fornecidos pela Google Search. O dataset final continha 3260 frases. Uma API da Google, denominada Google Cloud Speech-to-Text, foi usada para a conversão da fala em texto. Após a deteção dos eventos, informações textuais presentes no vídeo foram extraídas para identificar o jogador envolvido na acção. Finalmente, um resumo foi criado para cada evento detetado corretamente. Cada entrada do resumo é composta pelo nome do evento, o minuto em que ocorreu o evento e pelos nomes dos jogadores.

(6)

Os resultados para o melhor modelo, 92% de precisão e um F1 Score médio de 0.88, mostram que a ferramenta desenvolvida é capaz de detectar eventos para jogos de futebol, com uma taxa de erro baixa. Além disso, combinando várias fontes, foi possível aumentar o desempenho da ferramenta. É importante realçar que o trabalho desenvolvido irá permitir à MOG Technologies desenvolver projetos futuros baseados na solução proposta.

(7)

Acknowledgements

First, I would like to express my deepest gratitude to my supervisor Professor Luís Filipe Teixeira for all the valuable knowledge transmitted and support given to me during this Dissertation.

To Pedro Santos and all my co-workers at MOG Technologies, for having provided additional support during this semester.

To my friends and family, without them, it would not be possible to complete this journey with success.

Finally, to my partner, Mariana Cruz, for all the support and friendship throughout the years, especially in the toughest times. For her a huge thank you.

Miguel André Almeida Tomás Ferreira de Barros

(8)

(9)

“Curious that we spend more time congratulating people who have succeeded than encouraging people who have not.”

Neil deGrasse Tyson

(10)

(11)

List of Figures

2.1 Basic ASR structure [9]. . . 4

2.2 Text Mining Process [19]. . . 7

2.3 Word2Veclearning models, CBOW and Skip-Gram [23]. . . 9

2.4 Linear separation example using Support Vector Machine model [28]. . . 12

2.5 Polynomial Kernel transformation example to data points [28]. . . 12

2.6 RBF Kernel transformation example to data points [28]. . . 13

2.7 Similarity matrix of webcast text by (a) pLSA and (b) LSA [31]. . . 17

2.8 Latent Dirichlet allocation model [36]. . . 17

2.9 Lda2Vecmodel [33]. . . 18

2.10 Structure of the hierarchical system search model proposed by [35]. . . 19

2.11 Representation of the biological neuron (a) and the artificial neuron (b) [41]. . . . 20

2.12 Representation of an artificial neuron[40]. . . 20

2.13 Graphical representation of different activation functions [40]. . . 22

2.14 Representation of a fully connected multilayer feed-forward neural network topol-ogy [40]. . . 23

2.15 Convolutional operation [40]. . . 24

2.16 Example of a CNN architecture for sentence classification [45]. . . 25

2.17 LSTM block diagram [40]. . . 26

2.18 Three ways in which transfer might improve learning [47]. . . 27

2.19 Transfer learning representation based on feature-extraction [48]. . . 28

3.1 Complete system (long-term goal) . . . 31

3.2 Proposed Solution Structure. . . 32

3.3 Google Speech-to-Text tool error reduction [51]. . . 33

3.4 Dataset Creation Process. . . 34

3.5 Example of a soccer game summary provided by Google Search. . . 35

3.6 Transformer representation [55]. . . 38

3.7 Masked Language Model representation [55]. . . 39

3.8 BERT input representation [55]. . . 40

3.9 Google Cloud Vision text extraction demonstration. . . 41

3.10 An example of an event report. . . 41

3.11 An example of textual information present after the occurrence of an event. . . . 42

4.1 Example of transcription results with and without provided hints. . . 45

4.2 Example of the first two minutes of the created transcript for the game Dortmund vs Leverkusen. . . 46

4.3 Results obtained for the traditional machine learning models tested across the datasets. . . 49

(14)

4.4 Confusion Matrix for the Logistic Regression model. . . 49 4.5 Results obtained for the traditional machine learning models tested for the

Com-bined Dataset without the Goal class . . . 51 4.6 Results obtained for the BERT model tested for the different datasets across

dif-ferent epochs . . . 53 4.7 Results obtained for the BERT model tested for the Combined dataset without the

Goal class. . . 54 4.8 Loss evaluation over 100 epochs. . . 55 4.9 Example of a created summary for the Arsenal vs Manchester United game. . . . 56 B.1 Two examples of textual information present after the occurrence of an event. . . 63

(15)

List of Tables

2.1 Comparison of the main Speech Recognition Tools currently available [16]. . . . 5

2.2 Comparison between the main Speech Recognition Tools when noise is introduced [17]. . . 6

2.3 Keywords example [30]. . . 15

3.1 Distribution of games per league/competition in the videos collected. . . 33

3.2 Distribution of games per league/competition extracted from Google Search. . . . 35

3.3 Model hyperparameters applied in the grid search. . . 36

3.4 BERT Model Parameters Tested. . . 39

4.1 Distribution of the classes over the three datasets. . . 47

4.2 Comparison between the results obtained from different traditional models for the Transcript Dataset. . . 48

4.3 Comparison between the results obtained from different traditional models for the Google Dataset. . . 48

4.4 Comparison between the results obtained from different traditional models for the Combined Dataset. . . 48

4.5 Comparison between the results obtained from different traditional models for the Combined Dataset when the Goal class was removed. . . 50

4.6 Comparison between the results obtained from different traditional models for the Combined Dataset when the Goal class was replaced. . . 50

4.7 Model hyperparameters that yield the best results for every dataset tested. . . 51

4.8 BERT results for the Transcript Dataset across different epochs. . . 52

4.9 BERT results for the Google Dataset across different epochs. . . 52

4.10 BERT results for the Combined Dataset across different epochs. . . 52

4.11 BERT results for the Combined Dataset, with every Goal sentence removed, across different epochs. . . 53

4.12 BERT results for the Combined Dataset, with every Goal sentence replaced by Irrelevant,across different epochs. . . 54

4.13 Results obtained by combining the players’ names extracted from both sources. . 55

C.1 Best hyperparameters obtained for the Transcript Dataset. . . 65

C.2 Best hyperparameters obtained for the Google Dataset. . . 66

C.3 Best hyperparameters obtained for the Combined Dataset. . . 66

C.4 Best hyperparameters obtained for the Combined Dataset, when the Goal sen-tences were removed. . . 67

C.5 Best hyperparameters obtained for the Combined Dataset, when the Goal sen-tences were replaced with Irrelevant. . . 67

(16)

(17)

Acronyms

AM Acoustic Model

ANN Artificial Neural Network ASR Automatic Speech Recognition CBOW Continuous Bag-of-Words

CLI Command Line Interface

CNN Convolutional Neural Network

DNN Deep Neural Network

DNN-HMM Deep Neural Network Hidden Markov Models

EM Expected-maximisation

GPU Graphics Power Unit

HMM Hidden Markov Model

IDC International Data Corporation

IE Information Extraction

IR Information Retrieval

LM Language Model

LDA Latent Dirichlet Allocation LSA Latent Semantic Analysis

LSTM Long Short-Term Memory

MAP Maximum A Posteriori

MLM Masked Language Model

NLP Natural Language Processing

NN Neural Network

NSP Next Sentence Prediction OCR Optical Character Recognition

pLSA Probabilistic Latent Semantic Analysis

RBF Radial Basis Function

RELU Rectified Linear Unit RNN Recurrent Neural Network

STT Speech-To-Text

SVD Singular Value Decomposition

SWO Simple Word Overlap

TPU Tensor Processing Unit

WAcc Word Accuracy

WER Word Error Rate

ZB ZettaByte

(18)

(19)

Chapter 1

Introduction

1.1 Context and Motivation

In recent years, the amount of available data on the internet has exponentially grown. According to a report from International Data Corporation (IDC) published in 2011 the data volume in the world was 1.8 Zettabytes (ZB), which is the equivalent of approximately 1.8 × 1021 Bytes [1]. IDC forecasts that by the year 2025, the value is growing to 175 ZB [2]. Taking this into account, it is humanly impossible to understand or make decisions when faced with excessive amounts of information, a phenomenon known as Information Overload [3]. In order to allow the extrac-tion of knowledge from the textual data, several approaches were developed, such as Informaextrac-tion Retrieval (IR), Document Clustering, Information Extraction (IE), Natural Language Processing (NLP), among others.

Social networks have taken by storm the way contents are created and consumed. Nowadays, users have the need for instant access to contents, and content creators had to readjust the way they create and distribute contents to survive in the market. As an example, social networks are today, the main source of online news [4]. This also applies to sports contents.

Sport has been a part of our lives since the beginning of times, whether as spectators or par-ticipants. The diffusion and proliferation of multimedia platforms made the consumption of these contents easily and conveniently available to everyone. Sports videos appeal to a large population all around the world and have become an important form of multimedia content that is streamed over the Internet and television networks [5].

Moreover, sport content creators use social networks to provide users with innovative contents in the least time possible. These contents can be, for example, live commentary of a sporting event or a game summarization in the form of text or video. Currently, sports contents are created manually by studio professionals [6], which is an obstacle to reach users in a short time. To overcome this, the creation of automatic tools for the creation of content is required.

MOG Technologies is a company that provides its customers with products tailored to their needs. The lack of a tool that takes advantage of all information available to help creators to develop sports content automatically, to be used in the social networks, led to the proposal of

(20)

this Dissertation. The company wants to develop a proof-of-concept tool capable of summarizing football matches based on semantic content.

1.2 Objectives

As presented above, there is a need for a tool capable of creating automatic summaries for football. The main objective of the work is to create a tool capable of extracting important events as well as players involved based on semantic content and create a text summary. The commentator’s speech will be converted into text to extract events. Also, several machine learning model will be tested in order to classify each sentence in the text. After the extraction of both sentences and entities, a summary will be created. The summary must be in text format, and each entry should contain the event name, the time when the event occurred and the intervener players. Since it is important to have a functional system, existing solutions will be used instead of implementing and improving the algorithms for some tasks. The models’ accuracy will be analysed to validate the results obtained.

1.3 Contributions

The main contributions of this Dissertation for MOG Technologies include:

• A solution capable of creating automatic summaries for football matches based on semantic content. This tool will be adapted and implemented in future projects of the company. The project where this solution will be inserted cannot be detailed at this moment since it is confidential.

• A dataset that can be applied to sport classification problems and in futures projects as a starting point, skipping the need to spent time on data collection.

1.4 Document Structure

The remainder of this document is organized as follows:

• Chapter 2[State of the Art] - Review of existing solutions, Speech-to-text techniques, ASR API providers as well as Text Mining models.

• Chapter 3[Proposed Solution] - Presentation in detail of the proposed solution developed to achieve the objectives of this Dissertation.

• Chapter 4[Results] - Results obtained during the development of the project.

• Chapter 5[Conclusion and Future Work] - Conclusion of the Dissertation, regarding the work developed and future work in line with this Dissertation.

(21)

Chapter 2

State of the Art

This chapter presents a detailed review of the state of the art, focusing on the essential topics related to this Dissertation. Section2.1 describes the theoretical aspect of speech-to-text (STT) conversion. The available tools are present in subsection2.1.3. The next section2.2covers the most common techniques used for analysing text.

2.1 Speech-to-Text

Speech recognition is the process of converting an input speech signal into text. In computer-aided speech recognition system, an acoustic waveform is converted into a text containing the meaning of what has been spoken [7]. Speech recognition techniques can be classified into the following groups depending on input patterns:

• Speaker Dependent or Independent. The main difference is that in speaker dependent recog-nition, the recognition system uses a particular speaker in the training phase. The speaker independent system can identify any voice, not a specific person.

• Isolated vs Continuous Speech recognition. In the first case, the speech has precise pauses between two words while in the continuous case, there is no clear pause between words.

• Keyword-based vs Sub-word Based. Keyword-based systems are trained to identify com-plete words. In Sub-word based systems, it is possible to recognise syllables or phonemes by training the recognisers, which are categorised as sub-word dependent recogniser.

In recent years, systems that provide STT conversion use an important technology called Auto-matic Speech Recognition, ASR. In Fig.2.1, the structure of an ASR is illustrated. In traditional ASR systems, four main components are present: Signal processing and feature extraction, acous-tic model (AM), language model (LM), and hypothesis search [8]. The first stage, signal process-ing and feature extraction, takes as input an audio signal, the speech is enhanced by removprocess-ing the noise and channel distortions, that is converted from time-domain to frequency-domain after that, feature vectors are extracted for the following acoustic models. The acoustic model takes as input

(22)

the feature vectors, generated by the previous stage, and creates an AM score for the variable-length feature sequence. The language model estimates the probability of a word sequence by understanding the correlation between words belonging to a training corpus. Lastly, the hypoth-esis search component combines AM and LM scores and will output the word sequence with the highest score as a result [9]. The most important stages present in an ASR system will be detailed in the next section.

Figure 2.1: Basic ASR structure [9].

2.1.1 Acoustic Model

Acoustic Model has to deal with different issues such as the variable-length feature vector and variability in the audio signals [9]. Hidden Markov Model (HMM) is one of the most common algorithms used to create statistical acoustic models. HMM allows to address the problem of the variable-length feature vectors; another alternative used is dynamic time warping [8,9].

To improve the ASR accuracy, the combination of discriminative models such as deep neural networks (DNNs) and HMM, DNN-HMM, was adopted in the past years. The feasibility of hybrid models was possible due to the improvements in the computation power, availability of large training sets and better understanding of these models. A successful speech recognition system must be able to overcome some difficulties regarding audio variability. Examples of these variables are the speaker characteristics, speech behaviour and rate, environment noise, among others [9].

2.1.2 Language Model

Language Model uses n-grams1 models and can be adapted to a specific style of data by using interpolation. The assumption is that each corpus2 has unique characteristic captured in the n-grams count [12]. The LM specifies what are valid words, in the chosen language, and in what sequence they can occur. The training process is usually based on millions of word tokens and by reducing perplexity on training data. Perplexity is a measurement used to evaluate the performance

1_{n-gram is a continuous sequence of n words [}₁₀_].

2_{A corpus is a large set of structured texts . In NLP datasets are called corpora and a single set of data is called}

(23)

2.1 Speech-to-Text 5

of the model, lower is better. However, reduced perplexity does not mean better results. Traditional LM are bigram (pair of words) and trigram (group of three words) models containing the computed probabilities of grouping two or three particular words in a sequence, accordingly [8].

2.1.3 Speech Recognition Tools

After understanding the main structure of an ASR system, it is essential to investigate the different tools available that provide ASR systems. To evaluate the system’s performance, it is necessary to define metrics. The most common metric is word error rate (WER) and its opposite word accuracy (WAcc), the formulas are presented in Eqs. (2.1) and (2.2) [13]. The WER is calculated based on the number of substitutions (S), insertions (I) and deletions (D).

W ER=I+ D + S

N (2.1)

WAcc= 1 −W ER (2.2)

Nowadays, exists a wide variety of systems that implement speech-to-text conversion such as AT&TWatson, Microsoft Speech Server, Google Speech API, Nuance Recognizer, Sphinx and also a Portuguese company Voice Interaction [14,15].

In [16], a comparison between Microsoft API, Google API and Sphinx-4 is performed by calculat-ing the WER, discussed before. The test evaluated the capacity that these tools have to recognise a list of sentences. The results can be seen in Table2.1. After the extraction of the different tools, the researchers conclude that the Google API has a better acoustic and language model, probably due to the massive amount of data that Google uses to train their models [16].

Table 2.1: Comparison of the main Speech Recognition Tools currently available [16].

ASR WER WAcc

Sphinx-4 0.37 0.63

Google API 0.09 0.91

Microsoft API 0.18 0.82

Another aspect when considering ASR systems is to determine the performance when there is ambient noise, and when no clear separation of words is present. In the paper [17], they introduced ambient noise and speaker gender factor to measure the accuracy (WAcc) and simple word overlap (SWO)3_{. Blanchard et al. apart from comparing the tools present in the previous table}_2.1_also

compared Bing Speech, AT&T Watson and Sphinx WSJ. The results are displayed in table2.2, and it is possible to see that Google API and Bing Speech were more accurate with 0.56 and 0.51, respectively. The runners-ups were AT&T Watson with 0.41, 0.08 for Microsoft API, 0.14 for Sphinx with the HUB4 model and 0.00 for Sphinx with the WSJ model [17].

Further analysis revealed that both Google and Bing were mostly unaffected by speakers gender,

3_{"SWO is the number of words that appear in both the computer-recognized speech and the human- recognized}

(24)

speech class sessions and speech characteristics. Also, the results show that Google performed better than Bing when word order is considered (WAcc metric), but Bing had better accuracy when word order was ignored (SWO metric).

Table 2.2: Comparison between the main Speech Recognition Tools when noise is introduced [17]. In parenthesis is represented the standard deviations.

ASR WAcc SWO

Google Speech 0.56 (0.35) 0.60 (0.31) Bing Speech 0.52 (0.41) 0.62 (0.31) AT&T Watson 0.41 (0.48) 0.53 (0.31) Sphinx 4 0.14 (0.61) 0.32 (0.30) Microsoft API 0.08 (0.70) 0.33 (0.31) Sphinx WSJ 0.00 (0.67) 0.27 (0.27)

For this Dissertation, and considering the results described above the best system to convert speech to text is the Google API since it is the one with better results and having better API docu-mentation, the choice is simplified.

2.2 Text Mining

Text mining is the area of computer science research that tries to extract information from the high amount of data available nowadays by combining techniques from data mining, machine learning, natural language processing, information retrieval and knowledge management [18].

The main difficulties when dealing with automated text comprehension are caused by the fact that the human/natural language [19]:

• Has ambiguous terms and phrases.

• Usually relies on the context to define and bring meaning to sentences.

• Is heavily based on commonsense understanding and reasoning.

• Influences the interaction between people.

The text mining techniques are applied in industry, academia, web applications, internet, among others. The process involved in the conversion of unstructured data to structured data for extracting information is described in Fig.2.2. The main phases of text mining are Data Col-lection, Data Pre-Processing, Data Transformation, Data Analyse and Result evaluation [19].

• Data Collection: The primary purpose of text mining is extracting information from data, so the first step is to assemble data from diversified sources, and it can be in the form of reports, blog, reviews, news, etc.

(25)

2.2 Text Mining 7

Figure 2.2: Text Mining Process [19].

• Data Pre-Processing: After the collection of data, it is important to remove redundant infor-mation, inconsistencies, separate words and convert the words into their base form, usually called stemming. Also, a lot of unnecessary words are present in the majority of data, e.g. a, an, the, but. These words are denominated stop words and are removed in this phase. This phase helps in the data analysis process and speeds up the overall performance of the system.

• Data Transformation: It is the conversion of a text document into the bag of words or vector space document model. Regarding feature extraction, the appropriate meaning words are extractions from the document. In feature selection, relevant words are chosen.

• Data Analyse: In this phase, the data is processed using different text mining methods, for instance, information retrieval, categorisation, classification and summarization. The infor-mation is used to extract valuable and relevant inforinfor-mation for effective decision making and trend analyses.

• Evaluation: In the last phase, it is necessary to evaluate the performance of the method by calculating the precision, recall and accuracy.

The next sections will be focused on the steps that need to have a higher focus to understand text mining better.

(26)

Section2.3. The machine learning techniques used for text classification will be discussed in the next subsections. In detail, the traditional machine learning models and also a subset of machine learning called deep learning. Also, transfer learning will be discussed. In terms of the learning process, the algorithms can be divided into supervised and unsupervised. The unsupervised models do not need a labelled dataset to be capable of categorising data.

2.3 Feature Extraction

Feature extraction is the process of converting textual data into real-valued vectors. The most popular models used to extract features from text are:

• Bag-of-Words (BOW).

• Term Frequency–Inverse Document Frequency (TF-IDF).

• Word Embeddings.

2.3.1 Bag-of-Words

The bag-of-words model is a popular representation method for feature extraction. The algorithm counts the frequency of each word in a document. A simpler explanation of the model is to consider the text as a group of keywords and extract information by finding the selected keywords. The exact order of the terms extracted is ignored, and only the number of occurrences of each term in the dictionary is retained. So two sentences with the same terms but with different order are viewed as identical [20].

2.3.2 Term Frequency-Inverse Document Frequency

The term frequency-inverse document frequency transformation assigns weights to a term t in document d given by Eq. (2.3). The term frequency measures how frequent a term occurs in a document. Because of the fact that every document has different length a longer document has more probability of having more occurrences of the term t to prevent this, the number of occurrences is divided by the document length. The inverse document frequency calculates how important a term is for the document. By calculating the inverse document, rare terms have a higher weight for the document [20].

Wi, j = t fi, j |{z} Term Frequency × log N d fj | {z } Inverse Document Frequency

(2.3)

The value of the tf-idf equation is [20]:

• Highest when t occurs many times in a small number of documents.

(27)

2.3 Feature Extraction 9

2.3.3 Word Embeddings

Word Embeddings are used in the preprocessing step to represent a document as real-valued vec-tors in a predefined space. Each word is represented by one vector, and the vector values are learned in resemblance to a neural network. Word2Vec, GloVe and fastText are word embeddings algorithms used to represent documents in n-dimensional space [21].

Word2Vecis a statistical method for learning word embedding from a text corpus. It was intro-duced by Tomas Mikolov et al. [22]. Word2Vec proposed two different learning models to learn the word embeddings:

• Continuous Bag-of-Words, or CBOW model.

• Continuous Skip-Gram Model.

The CBOW model forecasts the current word based on its context. The Skip-Gram Model predicts the surrounding word given the current word [23]. The representation of both models is presented in Fig.2.3.

Figure 2.3: Word2Vec learning models, CBOW and Skip-Gram [23].

The Global Vectors for Word Representation, also known as GloVe, is an expansion of the Word2Vecmethod for efficiently creating word vectors. GloVe constructs an explicit word-context using statistics over the entire text corpus. The result is a learning model that might result in better word embeddings [21]. Word Embeddings allow for better representation of words and are usually used in the first layers of deep learning algorithms or to use with traditional algorithms as part of the preprocessing step like Lda2Vec model, discussed in the section Section2.4.2.

(28)

2.4 Machine Learning Models

In this section, an explanation and comparison between different machine learning models will be presented. First supervised models will be presented. After that, the most common unsupervised models for text classification will be explained.

2.4.1 Supervised Models

In supervised learning, models learn from labelled data. After having knowledge of the data, the algorithm predicts which labels are given to unseen data based on the patterns learned be-fore. These models are divided into two categories, such as classification and regression. For this Dissertation, only classification models will be taken into consideration [24].

Decision Trees

A decision tree is a tree-like separation of the data space, in which the partition is achieved with a series of split conditions on the attributes. The main idea is to separate the data space into attribute regions that are heavily biased towards a particular class during the training step. During the testing phase, the relevant partition of the data space is selected for the test instance, and the label of the partition is returned. Each node in the decision tree corresponds to an area of the data space defined by the split condition at its ancestor nodes, and the root node corresponds to the entire data space. However, because of the high-dimensional and sparse nature of text, decision trees do not work well for text classification without further tunning. A multiclass classification problem is decomposed into a binary classification problem using one-against-all approach to overcome the problem of dimensionality. The results obtained from different classifiers, one per class, are integrated by reporting the most confident prediction [25].

Random Forest

Random Forest model is an ensemble algorithm based on bootstrap aggregation. Decision trees are used for each classifier, and a random selection is used to determine the split at each node. The approach is to randomize the tree construction phase by granting the splits at the higher levels of the tree to use the best feature selected out of a restricted subset of features. Mainly, r features are randomly chosen at each node, and the best splitting feature is selected only out of these features. Moreover, different nodes use different subsets of randomly selected features. Smaller values of r result in higher randomization of the tree construction. At first glance, using such a randomized tree construction should impact the prediction in a negative way. Nonetheless, the key is that multiple randomized trees are grown, and the predictions of each test point over different trees are averaged to yield the final result. The class that receives the most number of votes is predicted for the test instance. The averaging process improves the quality of the predictions over a single tree by adequately using different terms at higher levels of the different trees in various ensemble

(29)

2.4 Machine Learning Models 11

components. The final prediction is more robust than a single decision tree and may help to prevent the model from overfitting, something that is usual in decision trees [25].

Logistic Regression

Logistic regression is a probabilistic model referred to as discriminative models. These models assume that the dependent variable is an observed value generated from a probabilistic distribution defined by a function of the feature variables. It is commonly used for binary problems, but it is possible to use the model for multiple classes. The probability of a binary event is estimated by a sigmoid function. The sigmoid function can take as input any real value and convert it between 0 and 1 [26], an example of a sigmoid function is present in Eq. (2.4). Logistic regression uses an estimator called maximum likelihood to calculate the model coefficients that relate predictors to the target.

p= 1

1 + e−(b0+b1x) (2.4)

A technique used in Logistic Regression to prevent overfitting is to introduce a penalty factor to the loss function. The loss function is a sum of the squared difference between the predicted value and the real value. The most common regularization technique is called L2 regularization or Ridge regularization. L2 regularization formula defines the regularization term as the sum of the squares of all the feature weights, Eq. (2.5), and forces the weights to be small but never equal to 0 [27]. L(x, y) = n

∑

i=1 (yi− hθ(xi))2+ λ n

∑

i=1 θ_i2 (2.5)

Support Vector Machine

Support vector machine is used in classification problems. The idea is to plot the data into an n-dimensional space, n is the number of features present in the problem, with the value of each fea-ture being the value of a particular coordinate. The model calculates a maximum-margin boundary that leads to the homogeneous partition of all data points, also called hyperplane. An example of a linear separation is present in Fig.2.4[28].

In a perfect scenario, the classes can be separated with no overlapping. However, non-linear classes appear with frequency. Different kernels4 are used, apart from the linear kernel, such as polynomial radial basis function (RBF), sigmoid is precomputed to overcome this problem. The most popular kernels are the polynomial and radial basis. The formulas of the different kernels are represented in equations Eqs. (2.6) to (2.9).

(30)

Figure 2.4: Linear separation example using Support Vector Machine model [28]. K(xi, xj) = (xi.xj+ 1)p; Polynomial Kernel. (2.6) K(xi, xj) = e−γ(xi−xj) 2 ; RBF Kernel. (2.7) K(xi, xj) = e −1 2σ 2(xi−xj) 2

; Gaussian Kernel (Special case of RBF). (2.8) K(xi, xj) = tanh(ηxi.xj+ ν); Sigmoid Kernel. (2.9)

The idea of using non-linear kernels is to apply a transformation to the data points, so it is possible to find non-linear boundaries. The polynomial kernel generates new features by applying a polynomial combination of all the existing features, an example is presented in Fig.2.5[28].

Figure 2.5: Polynomial Kernel transformation example to data points [28].

The Radial Basis Function converts the features by measuring the distance between all dots to a specific centre. The gamma γ, Eq. (2.7), controls the importance of the new features on the decision frontier. Higher values of gamma, the features will have more impact on the frontier

(31)

decision, an example of the transformation in Fig.2.6[28].

Figure 2.6: RBF Kernel transformation example to data points [28].

Multinomial Naïve Bayes

The Multinomial Naïve Bayes is a probabilistic classifier based on the Bayesian theorem and a particular application of the Naïve Bayes model. The multinomial Naïve Bayes is the most common model used for text classification and for that reason will be explained in detail in this section. The probability of a document d being present in class c is computed as followed in Eq. (2.10) [20]. Where wi represent each word in document d. The decision rule used by the model is called maximum a posteriori (MAP).

cNB= argmax c∈C

logP(c) +

_∑

i∈positions

logP(wi|c) (2.10)

To avoid underflow and increase the speed of the calculations, they are done in log space. Multinomial Naïve Bayes calculates the conditional probability of a particular document/word/term in class c by taking into consideration the frequency of the term wi in documents that belong to class c, Eq. (2.11).

ˆP(wi|c) =

count(wi, c) ∑w∈Vcount(w, c)

(2.11)

The vocabulary V consists of the combination of all the word types in all classes, not only the words in one class c. To eliminate zeros in Eq. (2.11) a Laplace smoothing is added, which simply adds one to each count leading to a new equation Eq. (2.12).

ˆP(wi|c) =

count(wi, c) + 1 ∑w∈Vcount(w, c) + 1

(2.12)

(32)

Algorithm 1 Multinomial Naïve Bayes Algorithm

1: procedure TRAINMULTINOMIALNB(C,D)

2: V← ExtractVocabulary(D) 3: N← CountDocs(D) 4: for c ∈ C do 5: Nc← CountDocsInClass(D,c) 6: prior[c] ← Nc/N 7: textc← ConcatenateTextOfAllDocsInClass(D,C) 8: for w ∈ V do

9: cond prob[w][c] ← count(wi,c

∑w∈Vcount(w,c)

return V, prior, cond prob

10: procedure APPLYMULTINOMIALNB(C,V,prior, condprob, d)

11: W← ExtractTokensFromDoc(V, d)

12: for c ∈ C do

13: score[c] ← log(prior[c])

14: for w ∈ W do

15: score[c]+ = log(cond prob[w][c]) return argmaxc∈Cscore[c]

2.4.2 Unsupervised Models

In the area of event detection using unsupervised machine learning algorithms, different techniques are explored, such as:

• Keyword-Based Search [6,29,30]

• Latent Semantic Analysis and Probabilistic LSA [31] • Latent Dirichlet allocation [32]

• Hybrid Approaches [33,34] • Hierarchical System Search [35]

Keyword-Based Search

A keyword-based search is the most basic model used to classify a document. The idea is to search for a group of keywords that represent the category of the documents, and if there is any word, the document can be associated with the category.

Now, in the field of sports event detection, many authors have explored this technique. How-ever, three papers presented a clear and straightforward understanding regarding this topic [6,29, 30]. A study about the event detection using a keyword-based search was performed in all of them. The authors focused on the analysis of well-structured text obtained from webcasting. Webcasting text contains contents focused on sports events with a well-defined structure available from differ-ent websites [30]. An example of selected keywords is present in Table2.3. With these approach all authors achieve a high level of precision, above 80% in all events, when detecting events, the

(33)

main reason for these results is based on the quality of the text and the presence of the keywords when referring to events.

The main drawback of this model is that for different sports, it will be necessary to predefine new keywords or even for the same game, the different presentation style and language will affect the results.

Table 2.3: Keywords example [30].

Event Keyword Event Keyword

Goal goal, scored Red Card dismissed, sent off Shot shot, header Yellow Card booked, booking

Sava save, blocked Foul foul

Offside offside Free kick free kick, free-kick Corner corner kick Substitution substitution, replaced

Latent Semantic Analysis and Probabilistic LSA

Latent Semantic Analysis is an unsupervised approach used in natural language processing. LSA constructs a term-document matrix to describe the occurrences in different documents. The ma-trix is broken down by using singular value decomposition, SVD. The cosine distance between vectors in the matrix is computed to measure the document similarity. However, LSA arises from linear algebra and is based on L2-optimal approximation, which corresponds to an implicit addi-tive Gaussian noise. Another problem of LSA is the incapability of handling polysemy5. This problem is caused by how LSA writes the words in the latent space, the words are described as a linear superposition of the coordinates of the documents that contain the word, the superposition principle used by LSA is unable to capture multiple meanings of a word [31].

The first step in LSA is to generate the document-term matrix. Given m documents and n-words in the vocabulary, it is possible to construct an m × n matrix to which each row represents a document, and each column represents a word. LSA replaces raw counts with TF-IDF score and assigns a weight for term j in document i as described in Eq. (2.3) [36].

As discussed before, the LSA matrix needs to be decomposed using SVD. In SVD, a rectan-gular matrix is decomposed into the product of three matrices. One matrix described the original rows as vectors of derived orthogonal factor values, another represents the original column enti-ties, and the last one is a diagonal matrix containing scaling values so that when the three matrices are matrix-multiplied, the original matrix is restored [37]. In Eq. (2.13), S is a diagonal matrix containing the singular values of M. In U , rows represent document vectors expressed in the form of topics. In V , rows represent term vectors expressed in term of topics [36].

M≈ UtStVtT (2.13)

(34)

The main advantage of LSA is the simplicity and efficiency in use, but it lacks efficient repre-sentation, and it needs a broad set of documents and vocabulary to get accurate results [36].

Another model used in NLP is probabilistic latent semantic analysis (pLSA), compared with LSA, pLSA is based on a mixture decomposition acquired from the latent class model, which results in a solid static foundation. pLSA applies statistical techniques for model fitting, model selection and complexity control. Mainly, pLSA defines a suitable generative data model and associates a latent context variable with each word occurrence, which takes into consideration polysemy [31]. Notably, pLSA used a model P(D,W ) such that for a document d and word w, P(d,w) corresponds to an entry in the document-term matrix [36].

To extract topic models, some probabilistic assumptions are added to the previous LSA model, such as [36]:

• given a document d, topic z exists in that document with a probability P(z| d)

• given a topic z, word w is extracted from z with probability P(w | z)

Formally, the joint probability of finding a document and word grouped is present in Eq. (2.14) [36].

P(D,W ) = P(D)

_∑

Z

P(Z|D)P(W |Z) (2.14)

In this case, P(D), P(Z|D) and P(W |Z) are parameters of the model. P(D) represents the prob-ability of a document is present in the corpus and can be extracted directly. P(Z|D) and P(W |Z) are modelled as multinomial distributions and can be trained using the expected-maximisation algorithm (EM) [36].

In [31], a comparison between these models for sports event detection can be found. After computing the results, it was possible to verify that the pLSA gives better results than LSA ex-tracting more categories and achieving better precision. The similarity matrix, depicted in Fig.2.7 shows that the documents in different clusters have a minimum similarity. In contrast, the results by LSA do not exhibit the high similarity of the documents in the same clusters [31].

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is an unsupervised generative probabilistic model for groups of discrete data. In LDA, each document is generated by selecting a distribution over topics and se-lecting each word in the document from a topic chosen according to the Dirichlet distribution [32]. LDA is a three-level hierarchical Bayesian model, where each item of a collection is modelled as a finite mixture over an underlying of topics. Each topic is modelled as an infinite mixture over an underlying set of topics probabilities. In the situation of text modelling, the topics probabilities give a graphic representation of a document [38].

Figure 2.8 represents the structure of the LDA model. First, from a Dirichlet distribution Dir(α), a random sample describing the topic distribution is retrieved, this topic distribution is θ .

(35)

Figure 2.7: Similarity matrix for webcast text by (a) pLSA and (b) LSA: The higher the intensity, the more similar the two documents [31].

From θ , a specific topic is selected, Z, based on the distribution. Next, from another Dirichlet distribution Dir(β ), a random sample representing the word distribution of the topic Z is extracted. The word distribution is ϕ. Finally, from ϕ, a word W is chosen [36].

Figure 2.8: Latent Dirichlet allocation model [36].

In [32], an LDA model was implemented to detect events in large corpora, and some interesting conclusions can be taken from this research. The results showed better precision and accuracy for domains where establishing a mapping from topics to class labels is easier than acquiring a labelled collection of documents [32].

Hybrid Approaches

In recent years development of new hybrid approaches, combining multiples models, have emerged, and in 2016, the Lda2Vec was introduced by Christopher Moody [33]. Lda2Vec is a model that learns dense word vectors in combination with Dirichlet-distributed latent document-level mix-tures of topic vectors. The model produces sparse, interpretable document mixmix-tures through a non-negative simplex constraint. It is easy to incorporate into existing frameworks and allows for

(36)

unsupervised document representations while simultaneously learning word vectors and the linear relationships between them [33].

Lda2Vecwas built using the skip-gram model of Word2Vec to generate word vectors. In this method, a context vector is used to make predictions. To create this context vector, it is necessary to have a word vector and a document vector, and by summing both, a context vector can be created [36].

The other vector, document vector, is a weighted combination of two other components, firstly a document weighted vector that is a representation of each topic in the document. The second component is a topic matrix that establishes a relation for each topic and a vector embedding [36]. By summing both document vector and word vector, a context vector is generated for each word in the document [36]. A representation of the Lda2Vec model is represented in Fig.2.9.

In [34] a comparison between Lda2Vec and single models like LDA and Word2Vec was per-formed, and the results show that the hybrid model outperformed the single models and not only preserved the statistical relationships between topics and documents but also assimilated the rela-tionships among words into the document vector [34].

The primary advantage of Lda2Vec is the capability of learning from word embeddings and simultaneously learns topic representations and document representation as well [36].

(37)

2.5 Deep Learning 19

Hierarchical System Search

Another interesting approach for semantic event extraction was developed by [35]. In this paper, an unsupervised model as proposed to extract semantic events from sports webcast text. First, the authors filtered insignificant words, such as stop words. However, it is relevant to notice that the stop words available online would remove important word such as "10-ft", which is vital to understand in basketball the range of the shot, so the authors created a sports stop word list. After filtering, each description is reduced and almost accurately describes an event, for example, "misses shot". The filtered sentences are fed into a hierarchical system search that is responsible for clustering the events[35]. The basic structured proposed is showed in Fig.2.10a.

With this model, the authors achieve 100% for both precision and recall rates. However, it was only possible to achieve these results after creating an extensive and well-built sports stop word to pre-process the well-structured text to extract sentences that almost represent an event.

The example of the hierarchical system search created can be seen in Fig.2.10b.

(a) Block diagram of the proposed method

(b) Example of the proposed hierarchical search system

Figure 2.10: Structure of the hierarchical system search model proposed by [35].

2.5 Deep Learning

Deep learning is a subfield of machine learning, where artificial neural networks (ANN), algo-rithms inspired by the human brain, learn from large amounts of data. In this section, the basics of neural networks, as well as examples of deep learning models, will be explained. Finally, a technique called transfer learning will be detailed.

2.5.1 Artificial Neural Networks

The interest in neural networks (NN) resulted from the inspiration of creating networks capable of mimicking the human brain as well as its ability to learn and respond. As a result, NNs have been used in a large number of applications with effective performance in a variety of fields [39].

(38)

The human nervous system is a complex neural network. The brain is the central element consisting of near 1010neurons that are connected to each other by means of sub-networks. Each neuron is composed of a body, one axon and multiple dendrites. The model of a neuron is repre-sented in Fig.2.11a. The primary function of a dendrite is to receive the signal from other neurons. The axon is responsible for conducting an electric pulse from the neuron body. The axon divides into several branches that connect to other neurons. The gap between the axon branches and the dendrites of other neurons is called synapse. Neurons create multiple synaptic connections with other neurons; the number can vary from a few hundred to 104_{. The cell body of a neuron sums} the signals coming from the dendrites. The impulse received will be sent to its axon if the input signals are sufficient to stimulate the neuron to its threshold level [39].

The perceptron, invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt , is a linear model used for binary classification and is considered an artificial neuron using the Heaviside step function for the activation function. In figure Fig. 2.11b, it is represented the perceptron neuron and the resemblances to the biological neuron [40].

(a) (b)

Figure 2.11: Representation of the biological neuron (a) and the artificial neuron (b) [41].

The neurons used in recent neural networks are similar to the first perceptron model. The main difference is the flexibility in the type of activation function used. The representation is presented in Fig.2.12. The components of an artificial neuron will be explained below.

(39)

Weights

The weights, present on the connections of a neural network, are coefficients that scale, to increase or decrease, the input signal to a given neuron in the network. In the representation of the neuron, Fig.2.12, the weights are notated as w [40].

Biases

Biases are scalar values inserted to the input to ensure that at least some nodes per layer are activated independently of the signal strength. Biases are usually notated as b [40].

Activation Functions

Activation functions define the output of a layer’s nodes forward to the next layer, taking into consideration the input signals. Activation functions are scalar-to-scalar functions, producing the neuron’s activation. These functions are used for hidden neurons to insert nonlinearity into the net-work’s modelling capabilities [40]. In Fig.2.13, a graphical representation of different activation functions is presented.

• Linear

A linear activation function is basically the identity function, and f (x) = W x, where the dependent variable is directly proportional to the independent variable. This function passes the signal without changes. Commonly a linear transform is used in the neural network’s input layer.

• Sigmoid

A sigmoid function modifies independent variables of near infinite range into probabilities between 0 and 1 without discarding outliers from data. The outputs will be very close to 0 or 1.

• Tanh

Tanh is a hyperbolic trigonometric function, tanh(x) =_cosh(x)sinh(x), and unlike the sigmoid func-tion, the normalized range is between -1 to 1. The advantage of the tanh function deals better with negative numbers when compared to the sigmoid function. A variant of the tanh function is the hard tanh, and the only difference is that any number higher than 1 is con-verted to 1 the same happens to every value lower than -1 is concon-verted into -1.

• Rectified Linear Unit

Rectified Linear Unit (ReLU) activation function activates a node only if the input is above a certain quantity. If the input is below 0, the output will be 0, but it will be activated if the

(40)

input is above a predefined threshold the relation between the dependent variable is equal to f(x) = max(0, x).

• Softplus

The last activation function is a smooth version of the ReLU function, which has smoothing and nonzero gradient properties, thereby improving the stabilization and performance of deep neural network designed with the softplus unit. However, it is much easier and efficient to compute ReLU and its derivatives when compared to the softplus function. The activation function is the following f (x) = ln[1 + ex]. A comparison between the ReLU and softplus functions can be seen in Fig.2.13d. In the figure, it is possible to verify that the softplus function is differentiable, and the derivative is nonzero everywhere on the graph the same does not happen with ReLU.

(a) Linear activation function. (b) Sigmoid activation function.

(c) Tanh activation function. (d) ReLu and softplus activation functions.

Figure 2.13: Graphical representation of different activation functions [40].

2.5.2 Loss Function

Loss functions are used to quantify how far a neural network is to the ideal toward which it is training. A metric is calculated based on the errors observed after that, all errors are aggregated and then averaged to extract a single value that represents how close the neural network is to its ideal. The most used functions for a classification problem are hinge loss Eq. (2.15), logistic loss Eq. (2.16) and negative log likelihood Eq. (2.17).

(41)

2.5 Deep Learning 23 L(W, b) = 1 N N

∑

i=1 max(0, 1 − yi j× ˆyi j) (2.15) L(W, b) = N

∏

i=1 ˆ yyi i × (1 − ˆyi)1−yi (2.16) L(W, b) = − N

∑

i=1 M

∑

j=1 yi, j× log( ˆyi, j) (2.17)

2.5.3 Deep Networks Architectures

The most common deep networks architectures are the multilayer feed-forward network, convolu-tional neural network (CNN) and the recurrent neural network (RNN) and will be explained in this subsection. Also, an RNN variant called Long Short-Term Memory (LSTM) will be explained.

Multilayer Feed-Forward Network

Multilayer feed-forward network is a fully-connected neural network with an input layer, one or more hidden layers, and an output layer. The input layer is composed of multiple neurons that receive information from external sources and passes it to the network for processing. The hidden layers receive information from the input layer and process it in a hidden way. There is no direct contact to the input or output layers. Hidden layers are the key to allowing neural networks to model non-linear functions. Finally, the output layers received processed information and sent out signals out of the system. A representation of a multilayer feed-forward network is represented in Fig.2.14[40].

Figure 2.14: Representation of a fully connected multilayer feed-forward neural network topology [40].

(42)

Convolutional Neural Networks

Convolutional neural networks use layers with convolving filters that are applied to local features [42]. The first use for this type of model was for computer vision but in recent years has been applied also for NLP tasks. CNN belongs to the class of deep, feed-forward neural networks where the nodes do not form cycles between them. Typically a CNN is composed of three layers: Convolutional layer, pooling layer and fully connected layer.

• Convolutional Layer

Convolutional layers are considered the core block of a CCN architecture. Convolutional layers transform the input matrix by using a patch of locally connecting neurons from the previous layer. A dot product between the neurons in the input layers and the weights to which they are locally linked in the output layer. The essential concept in these layers is called convolution [40].

A convolution is mathematically defined as an operation that describes a rule of how to merge two sets of information. The convolution takes as input a raw value of data, and a feature map is outputted. The operation, Fig.2.15, is often considered as a filter in which the kernel filters the input data for specific information.

Figure 2.15: Convolutional operation [40].

In Fig.2.15, it is possible to verify that the kernel is slid across the input data to produce the convoluted output data. At each stride, the kernel is multiplied by the input data, forming a single entry in the output feature map. The weights in a convolution layer are named as filter or kernel. This filter is convolved using the input data, and the result is a feature map, also referred to as an activation map [40].

• Pooling

Pooling layers are inserted at the end of the convolutional layers. They are used to reduce the dimensions of the data representation progressively. Pooling layers cut the data repre-sentation gradually over the network, and it is essential to control overfitting and reduces computation costs. The layer typically uses the max operation to resize the data spatially.

(43)

The operation is denotated as max pooling. For example, in Fig.2.16, the max value is extracted from each feature map. The operation does not alter the depth dimension [40].

• Fully Connected Layers

The fully connected layers are used to compute each class score. The dimensions of the output data are [1x1xN], N represents the number of output classes. For a binary classifica-tion problem, N would be equal to 2. Fully connected layers execute transformaclassifica-tions on the input data [40].

In [43], multiple neural network models were created to classify short-text classification, includ-ing a CNN. The CNN model was tested for different datasets and outperformed traditional models, including Naïve Bayes and SVM. Also, in [44], CNN was tested alongside with other deep learn-ing models and traditional models. The conclusions, yield by the authors were similar, the CNN model outperformed the traditional models. An illustration of a CNN architecture for sentence classification is present in Fig.2.16[45].

(44)

Recurrent Neural Networks

The Recurrent Neural Network model is different from feed-forward networks because it is pos-sible to send information over time-steps, which means that loops are created. Recurrent Neural Networks are closer to the human brain, which has a large amount of feedback network of con-nected neurons that can learn to translate a lifelong sensory input stream and convert it into useful motor outputs. RNNs operate on sequences of input and output vectors. The following list gives examples of the different operation sequences [40]:

• One-to-many: sequence output. An example can be, image captioning takes an image and outputs a sequence of words.

• Many-to-one: sequence input. For example, sentiment analysis where a given sentence is input.

• Many-to-many: For example, video classification: label each frame.

This type of network has issues with the vanishing and exploding gradient problem. This problem happens when the gradients become too small or too large respectively and make it hard to model long-range dependencies in the structure of the input dataset. The most effective way to overcome this issue is to use the Long Short-Term Memory variant of RNNs. LSTMs networks were introduced in 1997. The critical components of the LSTMs are the memory cell and the gates. The contents of the memory cell are adjusted by the input gates and forget gates. In the case of both gates being closed, the contents of the memory cell will remain unchanged between one step and the next. The gatting structure permits gradients to travel across many time-steps and with this overcome the vanishing gradient problem present in most of RNN models. The central body of the LSTM unit is referred to as the LSTM block. An example of an LSTM block is presented in Fig.2.17[40].

(45)

Regarding previous work, in [46], a bidirectional LSTM model was developed, and the au-thors achieved better results when compared to RNN and CNN models. In [43, 44], apart from comparing a CNN model, as discussed before, they also compared an RNN model, and the results were better than the ones obtained with a CNN.

Transfer Learning

The goal of transfer learning is to improve learning in the targeted task by leveraging knowledge from the source task. There are three standard measures by which transfer learning may improve learning. The first is the initial performance of an ignorant agent. Second is the necessary amount of time it takes to thoroughly learn the target task given the transferred knowledge compared to the required time to learn it from scratch. The final improve is the performance level achievable in the target task compared to the final level without transfer. Fig.2.18demonstrates the three measures mentioned [47]. Transfer learning usually is considered a viable solution when [40]:

• Training dataset is small.

• Training dataset shares visual features with the base dataset.

Figure 2.18: Three ways in which transfer might improve learning [47].

The methodologies used for applying transfer learning to deep learning models are:

• Feature-extraction

As discussed before, deep learning models are layered architectures that learn different fea-tures at different layers. The last layer is responsible for outputting the final value. The concept of a layered architecture allows the use of pre-trained models as a fixed extractor for other tasks by removing the final layer. Transfer learning based on feature-extraction is represented in Fig.2.19. For example, AlexNet model is used without the final layer to transform images from a new domain into a 4096-dimensional vector based on its hidden layers enabling to extract features from the source task [48].

(46)

Figure 2.19: Transfer learning representation based on feature-extraction [48].

• Fine-Tuning

Fine-Tuning is used to replace the final layer for the pretended task and also selectively retrain some of the previous layers. The initial layers of a deep neural network are respon-sible for capturing generic features while the last layers are focused on the specific task. In fine-tuning methods, some layers weights are fixed while other layers are retrained. To summarize, fine-tuning allows the use of the knowledge in terms of the overall architecture of the network and use its pre-trained weights as the starting point for the retraining phase.

2.6 Discussion

In this chapter, the state of the art related to this thesis was presented. The main conclusions about such topics are presented below.

• With the growth of technologies and available data, speech-to-text API providers give high WAcc that enables the development of this project. However, it is important to understand if an error occurs in the conversion process the text mining model might be affected by such errors.

• Text Mining techniques present several challenging issues. To address it, some solutions have been proposed in the field of sports events detection. However, not all approaches are perfect for this Dissertation due to the fact that the majority of the work done in this area is focused on analysing webcasting text, a well-structured text that reduces the problem complexity.

• Deep learning models are more complex and offer better results when compared to tradi-tional models. However, they require more training data and computatradi-tional power. Transfer

(47)

2.6 Discussion 29

learning techniques have been developed in recent years in order to overcome these prob-lems.

• Several authors explored different approaches to detect events in sports however, to the best of our knowledge, the analysis of the commentator speech to create an automatic tool was not performed.

(48)

Automatically generated summaries of sports videos based on semantic content

FACULDADE DE

ENGENHARIA DA

UNIVERSIDADE DO

PORTO

Automatically generated summaries of

sports videos based on semantic content

Miguel André Almeida Tomás Ferreira de Barros

Abstract

Resumo

Acknowledgements

Contents

List of Figures

List of Tables

Acronyms

Chapter 1

Introduction

1.1

Context and Motivation

1.2

Objectives

1.3

Contributions

1.4

Document Structure

Chapter 2

State of the Art

2.1

Speech-to-Text

2.2

Text Mining

2.3

Feature Extraction

2.4

Machine Learning Models

∑

∑

∑

∑

2.5

Deep Learning

∑

∏

∑

∑

2.6

Discussion

_∑

_∑