Unraveling compound taxonomies in untargeted metabolomics through artificial intelligence

(1)

UNIVERSIDADE DE LISBOA FACULDADE DE CIÊNCIAS

DEPARTAMENTO DE QUÍMICA E BIOQUÍMICA

Unraveling compound taxonomies in untargeted metabolomics through artificial intelligence

Henrique dos Santos Silva

Mestrado em Bioquímica Especialização em Bioquímica

Dissertação orientada por:

Professor Doutor António E. N. Ferreira

Professor Doutor Carlos Cordeiro

(2)

Resumo

A metabolómica é a identificação e quantificação do conjunto completo de metabolitos (metaboloma) numa amostra biológica – organismos inteiros, tecidos, culturas de células, etc.

Metabolitos são moléculas de baixo peso molecular e apresentam-se como intermediários ou produto final de múltiplas reações enzimáticas, fazendo, portanto, parte do metabolismo das células e dando informação sobre o seu estado. Uma das principais técnicas para adquirir dados de metabolómica é a Espetrometria de Massa (MS), que se destaca pela sua elevada sensibilidade para uma grande diversidade de compostos químicos, permitindo uma maior cobertura do metaboloma. Particularmente, os Espetrómetros de Massa de Ressonância Ciclotrónica de Ião com Transformada de Fourier (FT-ICR- MS) têm elevada exatidão de massa e conseguem atingir altíssimas resoluções, que resultam na reduzida necessidade de separação das amostras e permitem a identificação de padrões isotópicos de compostos de baixo peso molecular, como os metabolitos, tornando possível a atribuição desambigua da sua fórmula molecular.

Antes de surgir a MS, era impossível obter as razões elementares de compostos individuais, pelo que esta caracterização era feita para amostras inteiras (ou frações destas). Em 1959, Dirk Willem van Krevlen propôs que a natureza química das amostras podia ser inferida a partir das razões elementares da amostra, o que levou ao que agora é conhecido como diagramas de van Krevlen (O/C vs H/C), que foram pela primeira vez usados para estudar amostras de petróleo e querosene. Desde então, este tipo de representação tem sido usado para a caracterização de amostras orgânicas noutro tipo de aplicações, como a caracterização das principais categorias de compostos de matéria orgânica natural. Em 2003, os diagramas de van Krevlen foram usados pela primeira vez para a representação de dados de MS em metabolómica, e desde então, têm sido bastante utilizados para o efeito. Baseado neste método de utilizar os diagramas de van Krevlen para classificar compostos, um novo método de classificação foi proposto (MSCC), que se baseia na imposição de restrições em 10 features das fórmulas químicas (O/C, H/C, N/C, P/C, N/P, O, N, P, S, e Massa) para classificar compostos em 6 categorias diferentes: Lípidos, Péptidos, Açúcares aminados, Glícidos, Nucleótidos e Compostos Fitoquímicos. Apesar deste método apresentar um aumento significativo de desempenho relativamente aos que eram baseados nos diagramas clássicos de van Krevlen, estas categorias são inespecíficas para descrever a complexidade do metaboloma de um organismo.

A ChemOnt é uma taxonomia com uma hierarquia bem definida, um dicionário com anotações completas sobre cada uma das categorias, e um conjunto de regras de classificação que permitem que novas entidades (compostos) sejam também descritos. Isto permite uma classificação estrutural automática, baseada em regras bem definidas para todas as entidades químicas. A ChemOnt tem 11 níveis de classificação, sendo que os 4 primeiros níveis são, por ordem: Kingdom, Superclass, Class e Subclass. Tendo uma hierarquia bem definida, categorias bem descritas, e uma ferramenta automática de classificação de novos compostos, a ChemOnt é a taxonomia ideal para tarefas de classificação rápidas e de larga escala.

A inteligência artificial tem como objetivo simular o comportamento humano em máquinas para resolver problemas complexos. O ML é uma sub-área da inteligência artificial, e faz com que as máquinas aprendam automaticamente pelos dados, sem serem explicitamente programadas para o fazer, e de forma a prever o resultado de novos dados. Algoritmos de aprendizagem supervisionada têm como objetivo fazer a correspondência entre um determinado input e o output correto, o que é feito por inferência de uma função através de dados de treino labelled. Os tipos mais comuns são tarefas de classificação, que separam os dados quando os labels representam uma variável discreta, e as tarefas de regressão quando os dados representam uma variável contínua.

(3)

Este trabalho teve como principal objetivo criar um modelo de classificação de metabolitos mais robusto que o MSCC, usando métodos de inteligência artificial, que conseguem lidar com um grande número de features das fórmulas químicas, que poderão providenciar mais informação para a classificação, bem como uma taxonomia hierárquica mais descritiva. Para isso, irá aplicar-se uma estratégia hierárquica onde se usa um classificador local por cada parent node, usando algoritmos populares de ML para classificação: Random Forests (RF), K-nearest-neighbours (KNN), Logistic Regression (LR), Support Vector Machines (SVM), e Naive Bayes (NB).

O dataset foi criado com compostos de 4 bases de dados diferentes: Human Metabolome Database, Kyoto Enciclopedia of Genes and Genomes Compounds, Lipid Maps Structural Database, e Chemical Entities of Biological Interest. As features usadas foram a contagem atómica de todos os elementos químicos, a sua carga, massa monoisotópica, a contagem total de alguns grupos de elementos, e as razões O/C, H/C, N/C, P/C, e N/P. Após a construção do dataset, foi feita uma divisão aleatória treino/teste de 33/67, de forma estratificada, portanto mantendo as proporções de cada classe. Para realizar o treino e tuning dos classificadores, utilizou-se o método de grid search, em que um classificador é treinado com diferentes combinações de parâmetros, com o objetivo de determinar qual o que tem a melhor generalização para dados que não foram vistos. Esta avaliação foi feita recorrendo ao método de stratified 3-fold cross-validation com o tuning baseado no F-score com média macro, que atribui o mesmo peso a cada uma das classes. A seleção das features foi realizada com base no mean decrease in Gini impurity (MDI) das RF, removendo features que estivessem correlacionadas de entre as selecionadas anteriormente

A MDI revelou que de um total de 133 features, apenas 25 têm pelo menos 0.1 de importância em pelo menos um dos classificadores. Todos os classificadores necessários para a abordagem hierárquica foram treinados e otimizados com grid search usando os 5 algoritmos, e usando todas as features ou só as selecionadas para o classificador. O classificador dos compostos orgânicos ao nível da Superclass apresentava overfitting significativo. Foi testado um algoritmo de pruning (cost complexity pruning), que revelou ser ineficaz em diminuir o overfitting. Adicionalmente foram testadas duas estratégias binárias multiclass com as RF para treinar este classificador: output-code e one-vs-rest. A primeira foi aplicada diretamente com a implementação do scikit-learn. A segunda abordagem foi implementada recorrendo a um classificador binário por cada uma das classes a classificar, utilizando adicionalmente duas estratégias de amostragem aleatória do conjunto negativo de dados, de forma a combater o acentuado desequilíbrio no tamanho das classes existente no dataset. Estas abordagens revelaram também não ser eficazes para aumentar a performance do classificador.

Fazendo uma média entre todos os classificadores treinados, foi possível observar que os algoritmos com melhor performance são, por ordem decrescente: RF, KNN, LR, SVM, e NB. Para o modelo de classificação, foi escolhido o melhor conjunto algoritmo/parâmetros de cada classificador, tendo sido excluído o algoritmo NB por nunca ser o único algoritmo com melhor resultado num classificador, e o SVM, uma vez que não retorna estimativas de probabilidade de previsão.

Calculando a média ponderada do F1-score macro e micro no conjunto de validação dos classificadores em cada nível de classificação, foi possível concluir que a performance local dos classificadores não diminui ao longo de cada nível, permanecendo entre os 87-89% de exatidão nos 3 níveis de classificação para além do primeiro, que tem uma performance naturalmente melhor. O facto do F1-score macro ser melhor no último nível também indica que, mesmo com categorias mais específicas, a abordagem hierárquica é capaz de as distinguir e também que a composição química tem informação suficiente para o fazer. Relativamente à performance da abordagem hierárquica utilizada,

(4)

entre categorias do mesmo nível, particularmente para a classificação de classes mais pequenas e para diminuir a quantidade de recursos computacionais necessários para treinar um único classificador por nível.

Para avaliar a performance do modelo de classificação, foram realizados dois tipos de validação:

com o conjunto de teste do dataset inicial, e com dados de metabolómica de FT-ICR-MS. Para a previsão, foi utilizada uma abordagem top-down, assim como uma estratégia de blocking, onde a probabilidade de previsão multiplicativa em cada nível de previsão é sujeita a diferentes thresholds.

Usando esta estratégia, a maioria dos compostos conseguem manter os 4 níveis de classificação com uma probabilidade maior ou igual a 0.95. O F1-score micro (exatidão), expectavelmente desce ao longo dos níveis da hierarquia. O nível Kingdom tem uma classificação praticamente perfeita (exatidão = 99,98%), no nível Superclass a exatidão é de 88,4% com 3 categorias (de 26) que não são previstas, no nível Class a exatidão é de 79,7% com 74 categorias (de 311) não previstas, e no nível Subclass a exatidão é de 74,6% com 192 categorias (de 724) não previstas. Comparando os resultados da estratégia de blocking, conclui-se também que se deve ter em atenção ao balanço entre a exatidão e a cobertura de compostos com previsão. Para a validação experimental foram usados dados obtidos por FT-ICR-MS de amostras de levedura, bem como de impressões digitais humanas. Apenas foram previstos

“Compostos Orgânicos” em ambos os conjuntos de dados, com 100% de exatidão, sendo que no nível Superclass a exatidão é >92%, no nível Class >87% e no nível Subclass >78%.

Palavras-chave: Metabolómica; FT-ICR-MS; Classificação de metabolitos; Aprendizagem Automática

(5)

Abstract

FT-ICR-MS instruments have an ultra-high resolution and extreme mass accuracy, which allows for the unambiguous attribution of chemical formulas to metabolites. This work aimed to develop a tool for classifying FT-ICR-MS-based metabolomics that would use the annotated chemical formulas to classify metabolites into a more descriptive taxonomy than the ones in already developed classification systems.

The ChemOnt taxonomy was used, which is hierarchical and with four main classification levels:

Kingdom, Superclass, Class, and Subclass. AI approaches (ML classification algorithms) were used to build the classification model, using a local per parent node hierarchical approach. Five proven algorithms were used to train and tune each classifier: RF, KNN, LR, SVM, and NB. Tuning was performed with 3-fold cross-validation using the Grid Search algorithm based on the F1-score with macro average. Feature selection was performed using the MDI of RF, and one feature in pairwise correlated features was removed. MDI revealed that from a total of 133 features, only 25 had at least 0.1 importance in at least one of the classifiers. The “Organic compounds” classifier presented high overfitting. Cost-complexity pruning was used, however, performance did not increase, and overfitting did not decrease. Two multiclass approaches were used with this classifier: “output-code” and “one vs rest” with a sampling of the negatives. Neither has shown to increase the performance of the classifier as well. Performance of the algorithms was, in decreasing order: RF, KNN, LR, SVM, and NB. The last two algorithms were left out of the final classification model. Validation accuracy on the test set at each level was of 99,98% (Kingdom), 88,4% (Superclass), 79,7% (Class), and 74,6% (Subclass).

Experimental validation with FT-ICR-MS data (yeast and human fingerprint) showed that there were only “Organic compounds”, with 100% accuracy, and at the remaining levels: Superclass (>92%), Class (>87%), and Subclass (>78%).

Keywords: Metabolomics; FT-ICR-MS, Metabolite classification; Machine Learning

(6)

Acknowledgements

This master’s dissertation was a remarkable and tremendous learning experience, and I would like to thank several people. First of all, I want to thank Professor Doctor António Ferreira for the proposal of this project, and along with Professor Doctor Carlos Cordeiro, for their support and supervision throughout this year. Also, I want to thank Professor Doctor Marta Sousa Silva for always being present throughout this year

Also, a big thank you to all the people from the group who were frequently in the room and ready to talk about our research struggles and other random stuff. Thank you for always encouraging me when things were not going that well, for your companionship, for all the lunches and other not-yet fulfilled gatherings. Also, a kind note of appreciation to Francisco Traquete with whom I more frequently discussed insights and thoughts about my project.

To all my friends and family, thank you for always believing in me and for all your continuous support and encouragement through these years. A special thanks to my girlfriend, Raquel, who has been the greatest support, especially along this year. Thank you for all the times you had to listen to me explain things that you couldn’t quite understand just so that I could share my struggles, for always believing that I would be able to do things the right way, and for always pushing me forward.

(7)

Abbreviations

AI - Artificial Intelligence

API - Application Programming Interface CART - Classification and Regression Trees ChEBI - Chemical Entities of Biological Interest DAG - Directed Acyclic Graph

DT - Decision Tree FN - False Negative FP - False Positive

HMDB - Human Metabolome Database ID3 - Iterative Dichotomiser

InChI - IUPAC International Chemical Identifier LCL - Local classifier per level

LCN - Local classifier per node

LCPN - Local classifier per parent node LMSD - Lipid Maps Structure Database LR - Logistic Regression

MDI - Mean Decrease in Gini Impurity ML - Machine Learning

MS - Mass Spectrometry

MSCC - Multidimensional Stoichiometric Constraints Classification method N - Negatives

NB - Naïve Bayes

NMR - Nuclear Magnetic Resonance P - Positives

RF - Random Forest

RL - Reinforcement learning SVM - Support Vector Machines

(8)

TN - True Negative TP - True Positive

FT-ICR-MS - Fourier Transform Ion Cyclotron Resonance Mass Spectrometry KEGG - Kyoto Encyclopedia of Genes and Genomes

KNN - K-nearest-neighbours

(9)

Index

1. Introduction ... - 1 -

1.1. Metabolomics ... - 1 -

1.1.1. Mass spectrometry in metabolomics ... - 1 -

1.2. Compound classification ... - 2 -

1.2.1. ChemOnt, a chemical taxonomy ... - 3 -

1.3. Machine learning ... - 4 -

1.4. Types of machine learning algorithms ... - 5 -

1.4.1. Supervised learning ... - 5 -

1.4.2. Unsupervised learning ... - 6 -

1.4.3. Semi-supervised learning ... - 6 -

1.4.4. Reinforcement learning ... - 6 -

1.5. Machine learning classification task ... - 7 -

1.5.1. Decision Tree ... - 7 -

1.5.2. Random Forest ... - 10 -

1.5.3. Boosting ... - 10 -

1.5.4. Naïve Bayes ... - 11 -

1.5.5. K-nearest-neighbours ... - 12 -

1.5.6. Logistic Regression ... - 12 -

1.5.7. Support Vector Machines ... - 12 -

1.5.8. Dimensionality reduction and Feature learning ... - 12 -

1.5.9. Classifier Evaluation, Validation, and Selection ... - 13 -

1.5.10. Hierarchical classification approaches ... - 15 -

1.6. Aims ... - 18 -

2. Methods ... - 19 -

2.1. Dataset construction ... - 19 -

2.1.1. HMDB ... - 20 -

2.1.2. LMSD ... - 20 -

2.1.3. KEGG Compounds ... - 20 -

2.1.4. ChEBI ... - 20 -

2.1.5. Joining databases ... - 21 -

2.1.6. Classification – ChemOnt ... - 21 -

2.2. Data processing ... - 21 -

2.2.1. Classification imputation & filtering ... - 22 -

2.2.2. Feature engineering ... - 23 -

(10)

2.2.4. Scaling ... - 24 -

2.3. Machine learning hierarchical approach... - 24 -

2.4. Feature selection ... - 24 -

2.5. Classifiers’ training and tuning ... - 24 -

2.6. Binary multiclass approaches of RF ... - 25 -

2.7. Probability of prediction estimates ... - 26 -

2.8. Experimental data ... - 27 -

3. Results ... - 28 -

3.1. Feature selection ... - 28 -

3.2. Training & Tuning ... - 30 -

3.2.1. Random Forest – number of trees ... - 30 -

3.2.2. Kingdom level ... - 32 -

3.2.3. Superclass level ... - 32 -

3.2.4. Class level ... - 36 -

3.2.5. Subclass level ... - 38 -

3.3. Classifier selection ... - 38 -

3.4. Local classifier per level hierarchical approach ... - 39 -

3.5. Validation of the classification model ... - 40 -

3.5.1. Test set validation ... - 40 -

3.5.2. Experimental validation... - 52 -

4. Conclusion ... - 56 -

5. References ... - 58 -

6. Supplementary Material ... - 65 -

(11)

Figure Index

Figure 1.1. Mass spectrum from an FT-ICR-MS instrument coupled with electrospray ionization and illustration of the resolution of this instrument. ... - 2 - Figure 1.2. Van Krevelen diagram (H/C vs O/C representation) for the classification of organic compounds. Limits defined in previous studies are represented in ticked lines.[12] ... - 3 - Figure 1.3. Illustration of the ChemOnt taxonomy as a tree. [21] ... - 4 - Figure 1.4. Diagram showing the four main types of Machine Learning algorithms, the type of data used by each of them to build the model, and some examples. Adapted from [26]. ... - 5 - Figure 1.5. A) Structure of a Decision Tree: The topmost node is the Root node, Decision (or Internal) nodes are the intermediary ones, where each feature is tested, Leaf nodes have the outcome label, and the Branches represent the outcome of the test made on its node of origin.[26] B) Example of a classification tree, where all the features have categorical variables. Root and Decision nodes have a rectangular shape, and Leaf nodes have an oval shape.[34] ... - 8 - Figure 1.6. Example of Tree Pruning - Left: Unpruned Decision Tree Right: Pruned Decision Tree.

Adapted from [34]. ... - 10 - Figure 1.7. Hierarchical classification approaches. Circles represent classes. Dashed lines represent a multiclass classifier (flat classifier and local classifier per level), a binary classifier (Local classifier per node), and a multiclass classifier to predict their child nodes (Local classifier per parent node).

Adapted from [47]. ... - 17 - Figure 2.1. Example of the InChI and InChIKey for the caffeine molecule.[97] ... - 20 - Figure 2.2. Categories size distribution in the four main levels of the classification hierarchy. - 22 - Figure 2.3.Example of the imputation on the classification hierarchy. ... - 22 - Figure 3.1. Heatmap of feature importance at each parent node, based on the mean decrease in Gini impurity of a random forest trained with default parameters. Only features with at least 0.1 importance in one parent node are shown. ... - 29 - Figure 3.2. Out-of-the-bag score vs the number of trees in a random forest classifier (Kingdom, Superclass, and Class level classifiers). ... - 31 - Figure 3.3. Effect of post pruning (cost complexity pruning algorithm) on the six best RF estimators of the “Organic compounds” classifier... - 34 - Figure 3.4. Distribution of the number of categories predicted when the blocking strategy is applied with each threshold. ... - 41 - Figure 3.5. Kingdom level - confusion matrixes (no blocking and blocking with a 0.95 probability threshold) for the top-down prediction of the test set. ... - 43 - Figure 3.6. Superclass level - confusion matrix for the top-down prediction of the test set (without blocking). ... - 44 - Figure 3.7. Superclass level - confusion matrix for the top-down prediction of the test set (with blocking - 0.95 threshold). ... - 45 - Figure 3.8&9. Performance of the classification of each Class (Precision and Recall metrics).- 46 - Figure 3.10-12. Performance of the classification of each Subclass (Precision and Recall metrics).- 49 -

Figure 3.13. Distribution of Superclass and Class categories in the yeast data. ... - 53 - Figure 3.14. Distribution of Superclass and Class categories in the human fingerprint data. .... - 54 -

(12)

Table Index

Table 1.1 Evaluation metrics used for machine learning classifiers. Some are known by more than one name. TP, TN, FP, FN, P, and N stand for True positives, True negatives, False Positives, False

negatives, Positives, and Negatives. Adapted from [34]. ... - 14 -

Table 2.1. Parameter space explored with grid search using the RF, KNN, LR, SVM, and NB algorithms. ... - 26 -

Table 3.1. Kingdom level - cross-validation results from each grid search's best combination of parameters. The best feature selection/algorithm combination is highlighted. ... - 32 -

Table 3.2. Superclass level - cross-validation results from the best combination of parameters of each grid search. The best feature selection/algorithm combination is highlighted. ... - 33 -

Table 3.3. Results comparison of cross-validation scores from a grid search with a native random forest multiclass approach vs a random forest with a binary multiclass output-code approach. ... - 35 -

Table 3.4. Results from test prediction comparison between a random forest "one vs rest" multiclass approach (with and without a sampling of the negatives) vs a native random forest multiclass approach. ... - 36 -

Table 3.5. Class level - cross-validation results from the best combination of parameters of each grid search. The best algorithm is highlighted. ... - 37 -

Table 3.6. Summary of the grid search results at the Class level. ... - 37 -

Table 3.7. Summary of the grid search results at the Subclass level. ... - 38 -

Table 3.8. The number of times each algorithm has the best performance in a classifier. ... - 39 -

Table 3.9. Weighted macro and micro F1-scores of all the classifiers trained at each level ... - 39 -

Table 3.10. Comparison between the Local Classifier per Parent Node (LCPN) approach and a "flat" classifier for each level of classification (Local Classifier per Level hierarchical approach, LCL). The LCPN approach scores were the result of weighting the score of each classifier at each level. ... - 40 -

Table 3.11. The number of compounds with four predicted levels with at least one category (Superclass or Class levels) that does not need any classifier (only has one child). ... - 41 -

Table 3.12. The number of compounds with three predicted levels and one superclass which does not need any classifier (only has one child). ... - 42 -

Table 3.13. F1-score results from the top-down prediction approach on the test set and coverage of compounds when the blocking strategy is applied. ... - 42 -

Table 3.14. F1-score results from the yeast annotations prediction (top-down approach) and coverage of compounds when the blocking strategy is applied. ... - 52 -

Table 3.15. F1-score results from the human fingerprint annotations prediction (top-down approach) and coverage of compounds when the blocking strategy is applied. ... - 52 -

Table 6.1. Example of the before and after category filtering on subclass level. Colour separates groups of compounds with the same leaf category. ... - 65 -

Table 6.2. Example of the before and after category filtering on class level. Colour separates groups of compounds with the same leaf category. ... - 66 -

Table 6.3. Example of the before and after category filtering on superclass level. Colour separates groups of compounds with the same leaf category. ... - 67 -

Table 6.4. Hierarchy of “Inorganic compounds” (1592 samples). Highlighted categories only have one child and therefore do not need any classifier ... - 68 -

Table 6.5-14. Hierarchy of “Organic compounds” (288447 samples). Highlighted categories only have one child and therefore do not need any classifier ... - 69 -

Table 6.15-18. Subclass level - cross-validation results from the best combination of parameters of each grid search. The best feature selection/algorithm combination is highlighted. ... - 79 -

Table 6.19&20. Subclasses without any True Positives on the test set validation ... - 83 -

(13)

1. Introduction

1.1. Metabolomics

Metabolomics is the identification and quantification of the complete set of metabolites (metabolome) in a biological specimen - whole organisms, tissues, cell cultures, etc. [1–3] Metabolites are low molecular weight molecules and intermediate or end-products of multiple enzymatic reactions, they make part of the metabolism of the cells and therefore provide a functional readout of their state.

[4,5] Since the beginning of the post-genomic era, metabolomics has been an emerging field as it tries to complement other omics, proving to have unique advantages: significant changes in individual metabolite concentrations can be caused by small changes in individual enzymes, making them easier to detect sensibility-wise; also, the metabolome is closer to function, so it reflects more closely the activities of the cell at a functional level. Because of this, the result of gene expression is expected to be more amplified at the metabolome level compared to the transcriptome and the proteome. [5]

There are two main possible approaches in metabolomics experiments: untargeted and targeted. On the one hand, untargeted metabolomics is the identification and possibly relative quantification of as many chemical species as possible, which requires a comprehensive analysis of the metabolome. This approach is mainly hypothesis-generating and leads to large amounts of data. On the other hand, the targeted approach aims at identifying and quantifying a limited number (until the hundreds) of known metabolites. This is hypothesis-driven by what is being studied, such as clinical biomarkers, interesting pathways, or other key metabolites. [6]

1.1.1. Mass spectrometry in metabolomics

The two main techniques used for metabolomics data acquisition are Nuclear Magnetic Resonance (NMR) and Mass Spectrometry (MS). NMR is a fast and highly reproducible spectroscopic technique.

It is based on the difference of energy between the spins of an atom’s nuclei when placed in a high magnetic field. MS is a technique that detects the mass-to-charge ratio (m/z) of ions. Before entering the analyzer, the compounds in a biological sample must be ionised, and each metabolite will generate different peak patterns. MS-based metabolomics experiments generally start with a separation step (Liquid or Gas chromatography), which reduces the high complexity of the analysis.[3,7] Both techniques have their advantages and limitations. NMR is quantitative and requires fewer steps for sample preparation than MS, like separation or derivatization. Despite this, MS has greater sensitivity than NMR, allowing it to have more metabolome coverage, which is a significant advantage. MS also offers a broader number of approaches in terms of the different ionization techniques and mass analyzers, which can be used to increase the number of metabolites that can be detected. [7]

Fourier-Transform Ion Cyclotron Resonance Mass Spectrometers (FT-ICR-MS) are instruments with a very high mass accuracy (usually < 1 ppm) and ultra-high resolution (>1.000.000), which results in the reduced need for sample separation techniques before MS analysis and allows the identification of isotopic patterns in low-mass compounds, as it is the case of metabolites, making possible for the unambiguous assignment of a molecular formula.[8] Figure 1.1 is a mass spectrum from an FT-ICR- MS instrument obtained using electrospray ionization.

(14)

Figure 1.1. Mass spectrum from an FT-ICR-MS instrument coupled with electrospray ionization and illustration of the resolution of this instrument.

The FT-ICR-MS analyzer works by trapping the ions inside, long enough for their cyclotron frequency to be accurately measured. This analyzer allows for tandem MS analysis (MSⁿ), used for structure elucidation. Between each tandem MS stage, ions are isolated from the injected sample and fragmented, yielding more chemical species further analyzed in the next stage. The most common fragmentation methods are collision-induced dissociation and infrared multiphoton dissociation.[8]

1.2. Compound classification

Before MS, the elemental ratios of individual compounds could not be characterized. This was performed for whole fractions of samples after separation by the molecular weight of compounds.[9,10]

In 1950, Dirk Willem van Krevelen proposed that the chemical nature of samples, including the presence of structural motifs and chemical properties, could be inferred from the elemental ratios of the sample, which led to the development of what is known as the van Krevelen diagrams. These are a representation of the O/C vs H/C ratios of these elements in organic compounds (Figure 1.2) and were firstly used for studying petroleum and kerogen samples [9,11,12]. Since then, van Krevelen representations have been used for characterizing organic samples in other applications besides petrochemical.[12] This includes elucidation of chemical reactions [13–15], characterization of the main categories of natural organic matter compounds (lipid-like, protein-like, amino-sugar-like, cellulose-like, lignin-like, and condensed hydrocarbons) [14,16,17], or for studying the characteristics of carboxyl-rich alicyclic molecules [18,19]

and oxidized black carbon [20].

In 2003, van Krevelen diagrams were used to visualize metabolomics MS datasets, and since then, most of the analysis of complex mixtures using high-resolution MS has included one.[9] This is because

(15)

data can be annotated with chemical composition and easily transposed to this representation. Kim S, Kramer RW, and Hatcher PG have shown how a van Krevelen plot can be an effective and informative graphical method for displaying complex ultra-high-resolution mass spectrometric data using FT-ICR- MS coupled with electrospray ionization. [14]

Using the most abundant macro elements in organisms C, H, O, N, P, S, K, and Ca, and understanding how they are distributed between the main categories of organic compounds remains a key challenge due to the overlap of these features between different categories. One example is the broad use of the N/P ratio as a valuable index of protein/nucleic acids ratio since proteins are high in N and nucleic acids are high in P, however, the macro elements are commonly present in other categories of compounds. Another example is illustrated in Figure 1.2 with a van Krevelen diagram, where an overlap of categories occurs (with limits defined in previous studies), leading to incorrect classification and, on top of that, these categories are not specific enough to make a robust classification system solely based on these two stoichiometric ratios.[12]

Figure 1.2. Van Krevelen diagram (H/C vs O/C representation) for the classification of organic compounds. Limits defined in previous studies are represented in ticked lines.[12]

Based on the van Krevelen diagram method that uses O/C and H/C stoichiometric ratios to “classify”

metabolites in biological samples analyzed by FT-ICR-MS coupled with EI, Rivas-Ubach A., et. al reported a classification method (the multidimensional stoichiometric compound classification method, MSCC) [12], which is based on imposing constraints on ten features of elemental formulas composition (O/C, H/C, N/C, P/C, N/P, O, N, P, S, and Mass). They used more than 130 000 compounds that were fitted into six categories: Lipids, Peptides, Amino-Sugars, Carbohydrates, Nucleotides, and Phytochemical compounds (oxy-aromatic compounds). Although this method has shown an improvement in accuracy (over 98%) relative to all classifications based on classical van Krevelen diagrams, these classes can be considered too broad to describe an organism’s metabolome complexity.

[12]

1.2.1. ChemOnt, a chemical taxonomy

ClassyFire is an automated tool presented in 2016 by Djoumbou Feunang Y et al., aiming to comprehensively characterize, classify, and annotate chemical structures. Unlike the usual approach of

(16)

defined hierarchy, a comprehensive dictionary with annotations about each category, and a clear set of classification rules that allow for new entities to be accurately described. This allows for automated rule- based structural classification of essentially all known chemical entities.[21]

The ChemOnt has 11 levels in the taxonomy, and the first four are named in order: Kingdom, Superclass, Class, and Subclass. Any chemical class is considered a “category”. The Kingdom level separates entities into Organic and Inorganic compounds. Organic compounds are considered to be structures with at least one carbon atom. Inorganic compounds are considered to be structures with no carbon atoms, with a few exceptions of some “special” compounds (e.g. cyanide/isocyanide and derivatives, carbon monoxide, carbon dioxide, carbon sulphide, etc.). This classification is easily performed based on one’s structure molecular formula, however, the following levels depend on much more detailed rules. Superclasses (26 organic and 6 inorganic) are generic categories of compounds with general structural identifiers (e.g. organic acids and derivatives, phenylpropanoids and polyketides, organometallic compounds, and homogenous metal compounds). The next level (Class) includes 764 categories, representing more specific and recognizable structural features (e.g. pyrimidine nucleosides, flavanols, benzazepines, actinide salts, etc.). The next level has 1729 Subclasses and additional 2296 categories in deeper levels (5-11). This makes a total of 4825 chemical categories, including organic (4146) and inorganic (678) compounds, plus the root category (Chemical entities). Figure 1.3 illustrates this taxonomy as a tree structure. The ChemOnt ontology uses category names that are widely used and already exist in the chemical literature. Each category has other popular annotated terms (synonyms) and formal text definitions, which are the basis for the classification rules. [21]

Figure 1.3. Illustration of the ChemOnt taxonomy as a tree. [21]

Since the ChemOnt is based on chemical structure, has a well-defined hierarchy, descriptive categories, and an automated tool to classify compounds, it is a taxonomy suitable for fast large-scale chemical classification tasks.

1.3. Machine learning

For Humans, learning is the process of acquiring knowledge or a specific skill in a particular subject.

Psychology has various proposed definitions, and most of them state that learning is a change in behaviour of a subject to a given situation, or as a sequence of his or her repeated experiences in that situation. Artificial Intelligence (AI) is a technology that aims to simulate human behaviour in machines, with the goal of making computers solve complex problems. Machine Learning (ML) is a subfield of AI, and it makes machines automatically learn from data without being explicitly programmed to, with the goal of predicting new outcomes.[22–24]

The ability of Humans to process data is limited, and nowadays, with the increase in dataset size, it has been getting more challenging to extract useful information from it. That is when we apply ML, to

(17)

handle these data more efficiently. There are various ML algorithms, however, there is not a “one size fits all” for a specific kind of problem. [23]

“Omics” intrinsically deal with big data, and they have been expanding fields in the life sciences, specifically metabolomics is envisaged as one of the major tools that will most contribute to challenging research objectives (e.g.: the personalization of treatments in medical practice). This has brought a growing need to use bioinformatics and computational tools to store and retrieve the produced data and extract meaningful information to make the most out of it. One of these tools is ML.[3,25]

1.4. Types of machine learning algorithms

ML algorithms can fall into four main types: Supervised, Unsupervised, Semi-Supervised, and Reinforcement (Figure 1.4).

Figure 1.4. Diagram showing the four main types of Machine Learning algorithms, the type of data used by each of them to build the model, and some examples. Adapted from [26].

1.4.1. Supervised learning

Supervised learning algorithms have the task of making the correspondence between a given input and the correct output. This is done by inference of a function from labelled training data. The most common tasks performed by supervised types are classification, which separates the data when the labels represent a discrete variable, and regression, which fits the data into the model when the labels represent a continuous variable.[26]

The most used algorithms for classification are ZeroR, OneR, Naïve Bayes (NB), Decision Tree (DT), K-Nearest-Neighbours (KNN), Support Vector Machines (SVM), and Logistic Regression (LR).

On the other hand, the most used algorithms for regression are Linear Regression and Support Vector Regression. [26–28] Neural networks is a method used in deep learning, which has become the most widely used approach in the ML field. It can be used for other types of ML, but a great focus is given to supervised learning tasks.[29]

Random Forest (RF), Adaptive Boosting (AdaBoost), and eXtream Gradient Boosting (XGBoost) are also ML algorithms that combine simple models (DT) to solve a particular task, so they are considered ensemble learning algorithms and, just like the Decision Tree algorithm, they can be used both for classification and regression tasks. [26–28]

ML classification tasks, which are the main scope of this work, are further described in Chapter 1.5.

(18)

1.4.2. Unsupervised learning

Unsupervised learning algorithms are applied to unlabelled datasets. The main task of these algorithms is to find patterns, structures, and other types of knowledge in the data, and they are divided into two subgroups: clustering or associative. [26–28]

Clustering methods use similarity measures, such as Euclidean and Manhattan distances, and work by grouping observations into clusters. Observations within a cluster will have similarities between them, whereas different clusters will have dissimilar observations. [30] There are six types of clustering methods, and the most used algorithms are shown between brackets. Hierarchical clustering finds successive clusters based on previous ones, following one of two approaches, agglomerative or divisive, which do not need any input number of initial clusters (single and complete linkage) [26,27,30,31].

Partitional clustering starts with an input k number of clusters and iteratively relocates samples between them (K-means and K-medoids) [26,27,30]. In Density-based clustering, clusters are identified through the density of data points (DBSCAN and SSN) [26,30]. Grid-based clustering uses a grid structure in the data space and works by computing the grid's point density and forming clusters between adjacent high-density cells (STING and CLIQUE) [26,30]. Model-based clustering assumes that data follows a mathematical probability distribution and try to fit a finite number of components of these functions (MCLUST) [26,30]. Finally, Constraint-based clustering, which is not an unsupervised learning method since constraints are provided before the clustering, falls in the semi-supervised learning type of ML (COPK-means and CMWK-means). [26,32]

Association rule learning is a rule-based approach to discovering interesting relationships (“If-Then statements”) between variables in a large dataset. One example of these relationships is “If a customer buys x, he is likely also to buy y”. These associations have a range of application areas, such as medical diagnosis, usage behaviour analytics, web usage, smartphone apps, cybersecurity, and bioinformatics.[26] Many rule association learning methods have been proposed (logic-dependent, frequent pattern-based, and tree-based), and some of the most used algorithms are reviewed in [33].

1.4.3. Semi-supervised learning

Semi-supervised learning is a hybrid of supervised and unsupervised learning methods, as it works with labelled and unlabelled data [27] as in constraint-based clustering methods described above. There are two approaches to semi-supervised learning: Self-training and Co-training. [34]

Self-training algorithms are the simplest ones. With this approach, a classifier is built using the labelled data, which will then try to label the unlabelled data. The tuple with the most confident prediction is added to the label data, and this is an iterative process. [34]

On the other hand, with co-training, two or more classifiers teach each other, in which each uses a different and ideally independent set of features from each tuple. After training the classifiers with the labelled data, each one predicts the labels for the unlabelled data, and each classifier teaches the other the prediction made with the best confidence, adding it to the labelled set. This is a less error-prone approach than self-training since the latter tends to reinforce errors.[34]

1.4.4. Reinforcement learning

This is an interactive type of ML method, allowing an agent to learn by trial and error. Reinforcement learning (RL) methods allow the agent to interact with the environment and learn from its actions and experiences. The RL problems are defined as a Markov Decision process and typically include the Agent, the Environment, Rewards, and Policy.[26,35,36]

(19)

There are two main types of RL methods: Model-based and Model-free. The difference is that in model-based algorithms, behaviour is optimized using a model of the environment from which the next state is predicted using the action as an input, and in model-free algorithms, the transition of state probability and the reward function is not used to make predictions beforehand.[26,36] Monte Carlo experiments, Q-learning and Deep Q-learning are famous RL algorithms. [26]

1.5. Machine learning classification task

As described in Chapter 1.4.1, classification is a supervised learning method and a form for data analysis that creates models to describe data classes. These models, the classifiers, predict categorical (discrete or unordered) class labels. The algorithm mathematically maps a function from the input variables (X) named features to the output variables (Y), the class labels. [26,28,34]

One example of a classification task would be if a medical researcher with cancer data wanted to predict which of 3 treatments a patient should receive or if, based on that data, a specific patient would benefit from a given treatment.[34] Both examples involve building a classifier to predict one of the three or two class labels, respectively. The first example is a multiclass task, and the second is a binary task, which are common classification problems.

There are three main types of classification problems:

Binary classification is a classification task involving only two class labels, such as “true” and

“false”, or “yes” and “no”, like in the last example above. Commonly, binary classification tasks have one normal state label and an abnormal state label. This could be the result of a medical test or the classification of an e-mail as being “spam” or “not spam”, for example. [26]

Multiclass classification has more than two class labels and does not involve a typical normal or abnormal outcome like in binary classification. Instead, observations are classified as belonging to one of many classes. [26]

Multi-label classification is a generalization of multiclass classification and arises when a single observation can be associated with more than one label or class. [26]

1.5.1. Decision Tree

A DT is a prevalent non-parametric supervised learning method for classification and regression tasks. [26,34] This method involves building a flow chart, which resembles a tree structure and is represented in Figure 1.5.

For the task of classification and given an X tuple of features with an unknown label, this tuple is tested at each internal node, tracing a path until reaching the matching leaf node that holds the corresponding label prediction. Since DT does not make any assumptions about the data, it is appropriate for exploratory knowledge discovery and does not require any knowledge about it. This method can handle multidimensional data, has an intuitive and easily understandable way of representation and has a good performance. [34]

(20)

Algorithms for DT building have been evolving throughout the years. These algorithms are feature selection measures, and they select which feature contributes to the best separation of the data at each internal node.[34] During the late ’70s and early ’80s, John Ross Quinlan developed Iterative Dichotomiser (ID3)[37], a DT algorithm which, later on during the early ’90s, was improved and succeeded by C4.5 [38], and in 1997 by See5/C5.0 [39], all by the same author. Classification and Regression Trees (CART) was developed by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone around the same time as ID3, it was published in 1984 [40] and both triggered DT’s popularity [34].

CART has a similar approach as ID3 (or its more recent version, C5.0). The main difference is that it supports numerical target variables (regression tasks) and only supports binary trees, so each node only has at most two children (or outcomes) [41]. Scikit Learn, one of the most significant ML packages for python, uses the CART algorithm for its DT implementation [41].

CART and John Ross Quinlan’s algorithms have a greedy approach, where DT is constructed top to bottom in a divide and conquer manner, just like most DT induction algorithms.[34] The greedy approach means that each split is locally optimized at the node, maximizing data separation on one feature based on one of two criteria (described below): Entropy (Equation 1.1) or Gini impurity (Equation 1.4), where pi is the probability of randomly choosing one observation of a class (i) from a set of data (D), which is estimated from the ratio between the number of observations of that class in the data and the total number of observations in the data, ^|𝐶^𝑖,𝐷^|

|𝐷| [26,34,41].

Splitting criteria Entropy

The entropy of a node is a measure of its randomness. It quantifies the information needed to classify that node's observations. The entropy of a node is computed with Equation 1.1, and the “entropy” of a split A (cross-entropy or log-loss) is computed with Equation 1.2, which adds the entropy of each of the resulting nodes (j) with the term ^|𝐷^𝑗^|

|𝐷| as a weight factor of each partition. The information gain of a putative split A describes the expected reduction in the information required to classify D caused by knowing the value of the feature on which the split is occurring. This measure is the difference between the initial entropy and the split entropy (Equation 1.3), which is used to decide the best split possible.[34,41]

Figure 1.5. A) Structure of a Decision Tree: The topmost node is the Root node, Decision (or Internal) nodes are the intermediary ones, where each feature is tested, Leaf nodes have the outcome label, and the Branches represent the outcome of the test made on its node of origin.[26] B) Example of a classification tree, where all the features have categorical variables.

Root and Decision nodes have a rectangular shape, and Leaf nodes have an oval shape.[34]

(21)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) = − ∑ 𝑝_𝑖∗ 𝑙𝑜𝑔₂(𝑝_𝑖)

𝑛

𝑖=1

(𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟏)

𝐿𝐿_𝐴(𝐷) = − ∑|𝐷_𝑗|

|𝐷| ∗ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷_𝑗)

𝑚

𝑗=1

(𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟐)

𝐺𝑎𝑖𝑛(𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝐷) − 𝐿𝐿_𝐴(𝐷) (𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟑)

For algorithms that allow a split into multiple child nodes (non-binary trees), normalization can be applied due to the possibility of a split resulting in single populated leaf nodes, since a split like this would have the minimum entropy possible, 0. This method is further described in [34]

Gini impurity

Gini impurity (also known as Gini index), unlike entropy, considers a binary split at each node, and it represents the probability of each sample being mislabelled (Equation 1.4) [26,34]. For a putative split, the resulting Gini impurity is given by Equation 1.5, with the same weighting factor for each partition as before ^|𝐷¹^|

|𝐷| and ^|𝐷²^|

|𝐷|

,

which are the fractions between the number of observations in each partition and the total number of observations. Equation 1.6 describes the reduction in impurity by split A. The feature and split that maximize the reduction in impurity (or, equivalently, has the minimum Gini index) is the selected one. The selected feature, together with the split set (for discrete features) or the split point (for continuous features), form the splitting criterion. [34]

𝐺𝑖𝑛𝑖(𝐷) = 1 − ∑ 𝑝_𝑖²

𝑛

𝑖=1

(𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟒)

𝐺𝑖𝑛𝑖_𝐴(𝐷) = |𝐷₁|

|𝐷| ∗ 𝐺𝑖𝑛𝑖(𝐷₁) +|𝐷₂|

|𝐷| ∗ 𝐺𝑖𝑛𝑖(𝐷₂) (𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟓)

∆𝐺𝑖𝑛𝑖(𝐴) = 𝐺𝑖𝑛𝑖(𝐷) − 𝐺𝑖𝑛𝑖_𝐴(𝐷) (𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟔)

In both Entropy and Gini impurity measures: discrete feature splits are tested for each possible combination of values, and continuous feature splits are tested at every possible midpoint between each pair of sorted adjacent values[34]. These metrics are used for classification tasks, however, with regression, other metrics are applied: Mean Squared Error, Half Poisson deviance, and Mean Absolute Error are the ones used by the scikit-learn implementation [41].

Tree pruning

After building a DT, some branches may reflect outliers or noise in the training data. This is associated with overfitting of the data, in which the classifier becomes overly tailored to the data it has seen during training. To solve this problem, some tree pruning methods were developed, which aim to remove less-reliable branches to improve the generalization of unseen data (Figure 1.6). The pre- pruning approach halts DT construction earlier, before every leaf node is pure, which will hold the most frequent label on that subset.[34] This can be done by setting a specific threshold for the splitting criteria, a maximum number of levels that a tree can have, the minimum number of observations that a node should have, the minimum number of observations for splitting a node, the maximum number of leaf nodes in a tree, among others. All of these are parameters in the DT classifier implementation on scikit- learn. [41] On the other hand, post pruning removes the branches after the tree is “fully grown”. To do

(22)

label on that subset. Cost-complexity is a post-pruning algorithm used in the CART DT induction algorithm, particularly in the scikit-learn implementation. [34,41] Cost complexity is a function of the number of leaves in the tree and its error rate. The algorithm starts from the bottom of the tree with pruning until reaching the set threshold.[34]

Figure 1.6. Example of Tree Pruning - Left: Unpruned Decision Tree Right: Pruned Decision Tree. Adapted from [34].

1.5.2. Random Forest

As described in Chapter 1.4.1, RF is an ensemble method that combines a series of base classifiers to create an improved one. Each base classifier votes on the prediction they make, and the ensemble considers the votes of all base classifiers to make its own prediction. Specifically, RF takes DT as base classifiers, and they use bagging (bootstrap aggregation) for training and a random selection of features used to train each tree. Bagging is a method in which each classifier is trained with a set of observations Di of size d. Di is the result of random sampling with replacement of the original dataset, and d is the total number of observations in the original dataset. This implies that each training observation may be repeated, or it may not even be used to train that tree at all. The latter are the out-of-the-bag samples [34]

RF are harder to interpret than DT, however, since they are created using bagging, the uncorrelated trees operating together can outperform individual trees, forming a strong classifier that otherwise would be weak. Tree pruning in RF has been shown to have minimal effects on decreasing variance (overfitting) and that they are robust when only a small fraction of the variables are relevant. Since only a fraction of all features is used to build one tree, RF can also deal with high-dimensional data. Each RF tree is independent, making the training process parallelizable, and it supports class weight attribution, which can be helpful for unbalanced datasets.[41,42] Unlike other algorithms that rely on distance between the data, DT and RF do not need any feature scaling since they are rule-based.[28] Otherwise, this would imply an additional pre-processing step of the data. Distance-based algorithms are biased because they give more importance to features with bigger numerical values. All these advantages make RF a widely used algorithm in ML

1.5.3. Boosting

Boosting is an ensemble method that transforms several weak learners into strong ones. With boosting, a series of classifiers (usually shallow DT) are iteratively trained based on the errors of the

(23)

previous classifiers. AdaBoost is a popular boosting algorithm in which the weights of each sample are updated so that the following classifier can “pay more attention” to those samples misclassified in the previous classifier. The final boosted classifier combines the votes of each tree weighted by their accuracy.[34] Gradient-boosting is a method derived from boosting and consists of training the new classifiers to optimize a differentiable loss function.[43] XGBoost is an example of a gradient-boosting algorithm.

While boosting can sometimes achieve greater accuracy, these algorithms are more computationally expensive and more prone to overfitting than RF, needing a more thorough and extensive tuning process.

1.5.4. Naïve Bayes

NB is a probability-based algorithm (Bayes theorem) which assumes that the effect of an attribute (or feature) on a given class is independent of the values of the other features. This is known as classification conditional independence, which is made to simplify the computations involved, hence the name “naïve”. [26,34,44]

According to Bayes’ theorem, if 𝑋 is a data tuple, which can be considered our evidence, and 𝐶_𝑖 is some hypothesis such that the data tuple 𝑋 belong to a specified class 𝐶_𝑖. For classification tasks, the value of 𝑃(𝐶_𝑖|𝑋) is what we want to determine and represents the probability that 𝑋 belongs to 𝐶_𝑖 given that the 𝑋 is known. This is the posterior probability, which contrasts with 𝑃(𝐶_𝑖), the prior probability, which in this case would be the probability of any 𝑋 belonging to 𝐶_𝑖. The Bayes theorem finds a way of calculating the posterior probability of 𝐶_𝑖 (𝑃(𝐶_𝑖|𝑋)) from values that can be obtained from the data (Equation 1.7).[34]

𝑃(𝐶_𝑖|𝑋) = 𝑃(𝑋|𝐶_𝑖). 𝑃(𝐶_𝑖)

𝑃(𝑋) (𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟕)

The Naïve Bayes classifier works by maximizing 𝑃(𝐶_𝑖|𝑋), knowing that 𝑃(𝑋) is constant for all classes and 𝑃(𝐶_𝑖) is constant (if all classes are equally likely) or is the fraction of training samples in each class. 𝑃(𝑋|𝐶_𝑖) is computed assuming independence of all features (otherwise, it would be too computationally expensive to compute), with 𝑛 number of features - (Equation 1.8). 𝑃(𝑥_𝑘|𝐶_𝑖) is estimated from the training samples. If 𝑥_𝑘 is categorical, then 𝑃(𝑥_𝑘|𝐶_𝑖) is the fraction of training tuples labelled with class 𝐶_𝑖 that have 𝑥_𝑘value. If 𝑥_𝑘 is continuous, features are assumed to have a gaussian distribution, and 𝑃(𝑥_𝑘|𝐶_𝑖) is computed from the mean and standard deviation of that feature in the class.[34]

𝑃(𝑋|𝐶_𝑖) = ∏ 𝑃(𝑥_𝑘|𝐶_𝑖)

𝑛

𝑘=1

(𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟖)

This algorithm can be used and usually works well for binary and multi-class tasks. Compared to other algorithms, this has the advantage of being fast, only needing small amounts of training data, handling continuous and discrete features, and making probabilistic predictions. On the other hand, NB performance may be affected since the feature independence assumption is very strong, which is usually not the case in real-world datasets, and because a gaussian distribution is assumed for continuous features. [26,28,34,44] This algorithm is commonly used for text documents classification tasks, but it has several popular implemented variants: Gaussian, Multinomial, Complement, Bernoulli, and Categorical, which are described in [41].

(24)

1.5.5. K-nearest-neighbours

The K-nearest-neighbours (KNN) algorithm assumes that similar observations exist in proximity.

KNN is a lazy learner because it just stores the training data until a test set is provided. Only then is generalization performed to classify this data based on similarity (or proximity). KNN calculates the distance (using similarity measures) between one observation and all the data points in the dataset, then chooses the K closest points. The most frequent label in K will be the winning class. Despite being based on distances, categorical features may be used as well. Generally, a distance of 0 is considered when the value is the same, and a distance of 1 is considered when the value is different. KNN is easy to use and has a quick calculation time when the dataset is not too large. Despite this, it greatly depends on the quality of the data, there is a certain difficulty in finding the optimal value of K, and it has difficulties classifying boundary points. [28,34,44]

1.5.6. Logistic Regression

Like the NB algorithm, LR is probabilistic based and is used in classification tasks. Natively, the LR is used with binary problems, and the probability of the positive class is given by a logistic function (sigmoid) (Equation 1.9) – with X as the tuple of features from a sample. This algorithm deals with an optimization problem of minimizing the cost function in Equation 1.10, in which C is a regularization parameter and 𝑟(𝑤) is a regularization term. In scikit learn, 𝑟(𝑤) can be defined as: None, l1, l2, or elasticnet (mix of l1 and l2). For the multinomial case, these equations change to accommodate the fact that there are 𝑘 classes (>2), and the 𝑤 parameter becomes a matrix where each row corresponds to a class.[26,45]

𝑝̂(𝑋_𝑖) = 1

1 + 𝑒^−𝑋^𝑖^{∗𝑤+ 𝑤}⁰ (𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟗) min𝑤 𝐶 ∑(−𝑦_𝑖log(𝑝̂(𝑋_𝑖)) − (1 − 𝑦_𝑖) log(1 − 𝑝̂(𝑋_𝑖))) + 𝑟(𝑤)

𝑛

𝑖=1

(𝐄𝐪𝐮𝐚𝐭𝐢𝐨𝐧 𝟏. 𝟏𝟎)

LR performs better when the data can be separated linearly, and it usually overfits with high- dimensional data. In the latter case, the regularization techniques can be used to prevent overfitting. One of the major disadvantages of LR is that it assumes linearity between the independent and dependent variables. Despite being more frequently used for classification, LR can solve both classification and regression tasks. [26]

1.5.7. Support Vector Machines

SVM is an algorithm used for classification and regression with linear and nonlinear data. When the data is nonlinear, the algorithm uses a kernel that projects the data into a (usually) higher dimensional space where linear separation of classes is possible. Then, it searches for a linear optimal decision boundary (or hyperplane), which is found to maximize the distance between the support vectors (the margin of each class) to the hyperplane. SVM typically has a low generalization error, which provides robustness to the model. Another advantage is the kernel “trick” that gives flexibility to the algorithm.

On the other hand, SVM is not easily interpretable in the sense that the contribution of each feature is usually unknown. The choice of the kernel is also a drawback, as well as other parameters that critically influence the results and must be correctly set. SVM is also very sensitive to irrelevant features, which can become misleading, resulting in wrong classifications.[28,34,46]

1.5.8. Dimensionality reduction and Feature learning

In ML, usually, data can reach high dimensionalities, which can be challenging to deal with.

Therefore, methods for dimensionality reduction can be of great importance since it improves human

(25)

interpretations, reduces computational costs, and avoids overfitting of the data if there is redundancy between features, simplifying the ML models. To tackle this, one can use one of two main approaches, feature selection or feature extraction.[26]

With feature selection, a subset of the original features is maintained, decreasing model complexity throughout the elimination of irrelevant or less important features. This can increase training time, minimize overfitting through the model's generalisation, and increase the model’s accuracy. On the other hand, feature extraction reduces the number of features by generating new ones, discarding the original.

This usually provides a better understanding of the data, improves accuracy, and reduces computational costs and training time. [26]

Many methods have been proposed for this end, and some proven approaches for feature selection are described: setting a variance threshold on each feature and removing features with a low variance;

computing the pairwise correlation between features removing one of the features in a pair if they are found to be highly correlated; and performing univariate statistical tests for feature scoring (e.g., ANOVA and chi-square);[26] Model-based selection methods are often based on output feature coefficients (in linear models) or feature importance that can be used to select the most contributing features. A good example of the latter is in RF, which comes from the decrease in the Gini impurity of each feature within each tree (mean decrease in Gini impurity, MDI). Another popular model-based selection method is permutation feature importance, in which each feature is permutated k times, and in each permutation, prediction occurs. The resulting score is compared with a baseline score used to compute each feature's mean decrease in accuracy, from which its contribution can be extrapolated. This is a less biased approach but significantly more computationally intensive.[41] For feature extraction, one popular method is the Principal Component Analysis, which transforms a set of correlated variables into a set of uncorrelated variables, the principal components. [26]

1.5.9. Classifier Evaluation, Validation, and Selection

After building a classifier, which involves data gathering, pre-processing, and training of the model, there is the need to evaluate the model regarding accuracy in predicting the class labels it was trained to do. To do this, various tools and methods were developed.

Evaluation metrics

Various measures can be used to assess how good the predictions of our classifier are, summarized in Table 1.1: Accuracy, Error rate, Sensitivity, Specificity, Precision, and F-score. For these measures, Positives (P) stand for the observations with the class of interest, and Negatives (N) stand for the other observations. In each observation, each true label is compared to the predicted label, which can result in one of the following: True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), which are used to compute these measures and can be summarized in a confusion matrix.[34]

(26)

Table 1.1 Evaluation metrics used for machine learning classifiers. Some are known by more than one name. TP, TN, FP, FN, P, and N stand for True positives, True negatives, False Positives, False negatives, Positives, and Negatives. Adapted

from [34].

Measure Formula

Accuracy, recognition rate 𝑇𝑃 + 𝑇𝑁

𝑃 + 𝑁 Error rate, misclassification rate 𝐹𝑃 + 𝐹𝑁

𝑃 + 𝑁 Sensitivity, true positive rate, recall 𝑇𝑃

𝑃

Specificity, true negative rate 𝑇𝑁

𝑁

Precision 𝑇𝑃

𝑇𝑃 + 𝐹𝑃 F, F1, F-score, harmonic mean of precision and recall 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Usually, datasets can be imbalanced, which is usually the case in medical problems, where the disease is the class of interest and is usually a lot smaller than the non-disease class. If we used accuracy, the TP would be masked by TN. The same would happen with the error rate, making these measures unsuitable for unbalanced datasets.

Sensitivity and Specificity measure how well the positive class is being classified and how well the negative class is being classified, respectively, therefore, they can be used with imbalanced datasets.

Precision and recall are also widely used in classification, especially with imbalanced datasets. Precision is a measure of exactness (how many predicted positive tuples are actually positive), and Recall is a measure of completeness (how many positive tuples are actually positive), which is, in fact, the same as sensitivity. [34]

Sensitivity and Specificity can be combined into Accuracy by adding each one multiplied by the fraction of total positives and negatives, respectively. On the other hand, precision and recall can also be combined into a single measure – the F-measure – giving the same weight to both, making this a robust measure for imbalanced datasets.[34]

These measures can all be calculated for binary classification problems, however, only accuracy and error rate can be computed “as is” when there are more than two classes (multiclass). To overcome this and extend binary metrics to multiclass problems, the data is treated as a collection of binary problems, one for each class, by averaging the result metric value calculated for each class. While “macro” average gives equal weight to each class, “weighted” average accounts for class imbalance considering its size in the true data sample, and “micro” averaging gives each sample an equal contribution. [41]

Balanced accuracy is a measure that is also commonly used in imbalanced datasets. It is the macro average of the recall score or, equivalently, the accuracy where each sample is weighted to the inverse prevalence of its true class. [41]