Machine learning approaches for predicting effects of drug combinations in cancer

(1)

Universidade do Minho

Escola de Engenharia

Delora Soeiro Baptista

Machine learning approaches for

predicting effects of drug combinations in

cancer

(2)

Universidade do Minho

0DVWHUV'LVVHUWDWLRQ

Escola de Engenharia

Departamento de Informática

Delora Soeiro Baptista

Machine learning approaches for

predicting effects of drug combinations in

cancer

Master in Bioinformatics

Supervisors

Miguel Francisco Almeida Pereira Rocha, PhD

Carlos Daniel Moutinho Machado, PhD

(3)

i

Acknowledgements

First and foremost, I wish to express my gratitude to my supervisors, Professor Miguel Rocha and Dr. Daniel Machado. I thank them for offering me this opportunity and for their insight, guidance, and patience.

I thank the AstraZeneca -Sanger Drug Combination Prediction Chall enge organizers for providing the valuable resources for this work. I would also like to thank the DREAM community for the helpful discussions on the challenge forum.

Finally, a special thank you to my family and close friends for their encouragement and patience.

(4)

ii

Abstract

Drug combination therapies are commonly used to overcome tumor drug resistance. Computational methods can be helpful tools in drug combination discovery, but there are currently no e stablished methods for the prediction of drug combination effects.

This work, integrated in the AstraZeneca-Sanger Drug Combination Prediction challenge launched by the Dialogue for Reverse Engineering Assessments and Methods (DREAM) community , aimed to develop machine learning methods to estimate the effects of drug combinations on cancer cell lines. The challenge was divided into three subchallenges (1A, 1B, and 2) addressing different clinical scenarios.

A variety of machine learning models were devel oped and evaluated using cross-validation. Tree-based ensembles, particularly GB, performed best for this problem. Among the different the genomic datasets provided, the monotherapy, mutation and CNV datasets were the most informative and were the only ones used in the final models.

The best model, submitted to 1A, was an ensemble of gradient boosting (GB), random forest (RF), and partial least squares (PLS) regression models, having achieved an average weighted Pearson correlation of 0.30 , and ranking 24th among 76 submissions. The 1B model (average weighted Pearson correlation of 0.18; 47th/62 submissions) was also an ensemble of GB, RF, and PLS models. For subchallenge 2, a GB model was selected. It had a performance score (based on a three-way analysis of variance (ANOVA) ) of 5.15 and ranked 20th out of 39 submissions.

The strategies explored in this work and by the DREAM challenge community will help to further the development of computational methods for the rational design of effective drug combinations for cancer therapy.

(5)

iii

Resumo

A utilização de múltiplos fármacos em combinação é uma est ratégia comum para superar a resistência a medicamentos em tumores. Métodos computacionais podem ser ferramentas valiosas na descoberta de novas combinações de interesse, mas atualmente não existe nenhum método estabelecido para este propósito.

Este trabalho, integrado na iniciativa AstraZeneca-Sanger Drug Combination Prediction challenge proposta pela comunidade DREAM, tinha como objetivo o desenvolvimento de métodos de aprendizagem máquina para prever os efeitos de combinações de fármacos em linhas celulares tumorais. O problema encontrava-se dividido em três desafios (1A 1B e 2) que abordavam cenários clínicos distintos.

Vários modelos foram desenvolvidos, sendo avaliados atr avés de validação cruzada. Conjuntos de modelos baseados em árvores de decisão conseguiram um melhor desempenho. De todos os conjuntos de dados, os dados de monoterapia, de mutações e de variação do número de cópia foram os mais informativos, tendo sido utilizados pelos mode los finais.

O estimador utilizado para a tarefa 1A (média ponderada da correlação de Pearson de 0.30; 24º em 76 submissões) foi um conjunto composto por modelos de gradient boosting (GB), random forest (RF) e regressão por mínimos quadrados parciais (PLS). Para o problema 1B foi utilizado outro conjunto de modelos com GB, RF e PLS (0.18; 47º em 62 submissões). Para a questão 2, foi desenvolvido um modelo de GB que conseguiu um desempenho (calculado com base nos resultados de uma ANOVA) de 5.15, tendo sido o 20º melhor modelo num total de 39.

As estratégias exploradas neste trabalho e pelas outras equipas que participaram neste desafio da comunidade DREAM são um contributo útil para o desenvolvimento futuro de métodos computacionais para o desenho racional de combinações de fármacos eficazes para o tratamento de tumores.

(6)

iv

Abbreviations

ANOVA Analysis of variance

CART Classification and regression trees

CCLE Cancer Cell Line Encyclopedia

CNV Copy number variation

COSMIC Catalogue of Somatic Mutations in Cancer

CSV Comma separated values

DIGRE Drug-induced genomic residual effect

DREAM Dialogue for Reverse Engineering Assessments and Methods

FATHMM Functional analysis through hidden markov models

GB Gradient boosting

GDSC Genomics of Drug Sensitivity in Cancer

HGSV Human Genome Variation Society

MSigDB Molecular Signatures Database

NCG Network of Cancer Genes

PLS Partial least squares regression

QA Quality assessment

RF Random forest

RMA Robust multi-array average

SMILES Simplified molecular input line entry specification

SNP Single nucleotide polymorphism

SVR Support vector machine regression

(9)

vii

List of Figures

Figure 2.1 – Mechanisms of drug resistance ... 10

Figure 3.1 – The three subchallenges of the AstraZeneca-Sanger DREAM challenge. ... 20

Figure 3.2 – General flowchart for subchallenge 1A. ... 21

Figure 3.3 – General flowchart for subchallenge 1B. ... 22

Figure 3.4 – The output that is generated for subchallenge 2... 23

Figure 3.5 – The official timelines for subchallenges 1A, 1B and 2 ... 24

Figure 3.6 – Pipeline developed in this work. ... 28

Figure 4.1 – Score distribution for the subchallenge 1A final submission round. ... 50

Figure 4.2 – Score distribution for the subchallenge 1B final submission round. ... 50

(10)

viii

List of Tables

Table 3.1 – Datasets used in this work. ... 25

Table 3.2 – Summary of the algorithms that were used ... 33

Table 4.1 – A selection of the best cross-validation results from round 4 ... 41

Table 4.2 – The effect of using the MSigDB gene list. ... 42

Table 4.3 – The effect of adding PolyPhen mutation effect predictions. ... 42

Table 4.4 – The effect of using the NCG gene list. ... 43

Table 4.5 – The effect of scaling. ... 43

Table 4.6 – Gradient boosting models generated before the final submission ... 45

Table 4.7 – “Wisdom of crowds” models generated before the final submission ... 46

Table 4.8 – Leaderboard and final submission results for subchallenge 1A. ... 47

Table 4.9 – Leaderboard and final submission results for subchallenge 1B. ... 48

Table 4.10 – Leaderboard and final submission results for subchallenge 2 ... 49

Table 4.11 – Input features ranked by their importance for the subchallenge 1A model. ... 52

Table 4.12 – Input features ranked by their importance for the subchallenge 1B model. ... 52

Table 4.13 – Input features ranked by their importance for the subchallenge 2 model. ... 53

Table A.1 – Cross-validation performance scores for subchallenge 1A GB models. ... 67

Table A.2 - Cross-validation performance scores for subchallenge 1B GB models. ... 68

Table A.3 - Cross-validation performance scores for subchallenge 1A RF models. ... 69

Table A.4 - Cross-validation performance scores for subchallenge 1B RF models. ... 70

Table A.5 - Cross-validation performance scores for subchallenge 1A PLS models. ... 71

Table A.6 - Cross-validation performance scores for subchallenge 1B PLS models... 72

Table A.7 - Cross-validation performance scores for subchallenge 1A “Wisdom of crowds” models. ... 73

Table A.8 - Cross-validation performance scores for subchallenge 1B “Wisdom of crowds” models. ... 74

Table A.9 - Cross-validation performance scores for subchallenge 1A SVR models. ... 75

Table A.10 - Cross-validation performance scores for subchallenge 1A Stacking models. ... 76

(11)

ix

List of Equations

Equation 3.1 - Subchallenge 1 primary scoring metric ... 37 Equation 3.2 - Subchallenge 2 primary scoring metric ... 37 Equation 3.3 - Three-way ANOVA model used by the subchallenge 2 primary scoring metric .... 37

(12)

1

Chapter 1 Introduction

1.1 Context

The efficacy of single-agent cancer therapies is often diminished by the existence of tumor drug resistance mechanisms. Resistance -conferring characteristics may be innate or they may arise in response to the treatment itself. In addition, due to the heter ogeneity of tumors, drug-resistant cells may be positively selected during treatment (Holohan et al. 2013).

Drug combination therapies have been used as an attempt to overcome drug resistance. When drugs are jointly administered, the resulting effect can be classified as additive, if it is the expected effect; synergistic, if the response is enhanced beyond expected; or antagonistic, if the response is reduced. Synergy is a highly desirable outcome, as it increases treatment efficacy without requiring an increase in drug dosage, potentially avoiding an increase in toxicity as well (Tallarida 2011).

Novel drug combinations are discovered experimentally through high-throughput screening, in which numerous compound pairs are tested on a panel of cell lines at varying concentrations. Given the vast number of drug combinations that are possible, it would be unfeasible to test all conceivable pairs, for both practical and financial reasons (Bansal et al. 2014). Computational methods that can accurately predict drug response before experimental screening could be of great help in reducing the search space and experimental effort required. However, few methods currently exist for this task.

The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenges are a community effort aimed at developing new computational approaches to address important biological and human health questions. In 2012, the community launched the DREAM7 - NCI-DREAM, Drug Sensitivity and Drug Synergy Challenges. One part of this challenge focused on

(13)

2 using the pharmacological data, gene expression profiles and copy number information of a human B cell lymphoma cell line treated with several drug combinations to computationally ra nk the compound pairs from most synergistic to most antagonistic. This study revealed that, although the prediction accuracy of the majority of the submitted models was not ideal, computational prediction of compound synergy is possible (Bansal et al. 2014), and several of the best performing methods have been subsequently published (Yang et al. 2015; Goswami et al. 2015; Zhao et al. 2014) .

In September 2015, the DREAM community launched the AstraZeneca-Sanger Drug Combination Prediction challenge to gather more insight on the factors underlying drug synergy and to further the development of drug synergy prediction methods. Participants are supplied with both pharmacological and genetic/genomic data from a panel of 85 cell lines treated with a large number of drug combinations, and are expected to build models with and with out using synergy training data, to simulate different clinical scenarios.

1.2 Objectives

Given the context presented above, the main aim of this work is the development and evaluation of machine learning based methods for the prediction of the effects of drug combinations on cancer cell lines, taking into account diverse features of the cell lines and used drugs.

More specifically, the work will address the following scientific/ technological objectives:

• Review the relevant literature for machine learning methods and their applications in related scenarios;

• Studying the available data sources in detail, both for th e cell lines and used drugs;

• Developing machine learning pipelines using the available data and evaluating the different alternatives based on the defined criteria;

(14)

3 • Evaluating efficient feature selection methods to select the most appropriate set of features for enhanced predictions;

• Generating and submitting predictions for unknown data.

1.3 Structure of the thesis

Chapter 2 Drug Combination Prediction

The machine learning concepts and algorithms mentioned throughout this dissertation are introduced. The use of combinatorial therapies in cancer treatment is explained, and the experimental determination of drug interaction effects is described. Some of the existing computational methods for the combination effects prediction are described.

Chapter 3 Materials and Methods

The AstraZeneca-Sanger Drug Prediction DREAM challenge and the machine learning pipeline developed during this work are described in detail.

Chapter 4 Results

Validation and official challenge results are presented in this chapter.

Chapter 5 Discussion

The models developed in this work are compared to the best performing teams in this challenge and to previous DREAM challenges. Feature and dataset importance is also discussed.

Chapter 6 Conclusion

A brief analysis of this work and future perspectives are presented in this section.

(15)

4

Chapter 2 Drug Combination Prediction

2.1 Supervised Machine Learning

The field of machine learning is focused on the design and applicatio n of algorithms that can learn information directly from data, and make accurate predictions using a model that is inferred from input data alone (Mohri et al. 2012).

Supervised learning is one of the most common machine learning scenarios. In supervised learning tasks, an estimator is firstly trained on a training dataset that contains a set of input variables (features) and the corresponding response (output) values. The goal is to obtain a function that maps the input variables to the output variable , while minimizing a particular loss function (Friedman 2002). Once constructed, the model can be applied to previously unseen data to predict the corresponding output values.

Supervised learning can be used for classification tasks, if the response value is categorical, ranking tasks, and regression problems, when the output variable is continuous. Given that the output variable of the training sets used throughout this work is continuous, the DREAM challenge questions were approached as regression problems. There are various regression algorithms, such as linear regression methods, support vector machine regression (SVR) , and tree-based methods, among others.

Linear regression methods are the simplest regression algorithms. They assume that the relationship between the input features and the output variable is linear, and attempt to find the linear function that best fits the data . In ordinary least squares linear regression, the model parameters are estimated by minimizing the sum of squared errors .

If the determined model is excessively complex, overfitting may occur, causing the model to lose generalization capacity. To avoid overfitting, a regularization term that penalizes the size of the regression coefficients can be

(16)

5 added. Ridge regression is a linear least squares regression method with regularization which uses the L2 norm as a penalty.

Support Vector Regression (Drucker et al. 1996) is a kernel-based machine learning method. To be able to model non-linear problems, SVR algorithms start by mapping the non-linear input features to a new, linear feature space using a kernel function. The SVR method is based on the optimization of the ε-insensitive loss function (although other loss f unctions may be used). The main objective is to determine a function where the observed output values have at most ε deviation from the predicted values (Smola & Schölkopf 2004). Any errors that are within this threshold are ignored, while deviations greater than this are increasingly penalized according to the distance of the observed value to the predicted funct ion. As a result, the SVR model is only dependent upon a subset of the training data.

To improve model performance, multiple estimators can be combined in ensembles. A simple ensemble is one where predictions from several different estimators are averaged, giving the final prediction values (this is referred to as a “wisdom of crowds” approach in the remainder of this text). Stacking is another ensemble strategy, where predictions from various models are used as input for another estimator that makes the final prediction (Wolpert 1992).

Random forests (RF) (Breiman 2001) and gradient boosting (GB) (Friedman 2001) are more elaborate ensembles. Both typically use decision trees as their base estimator. Decision trees build trees of decision rules inferred from the input data. Each internal node (decision nodes) of the tree represents a decision that is made on a part icular feature, while branches represent the possible values for a given feature and leaf nodes are the final predicted values. There are several tree building algorithms, one of which is the classification and regression trees (CART) algorithm (Breiman et al. 1984). CART regression trees minimize the residual sum of squares when deciding on features to split decision nodes. To reduce overfitting and improve model performance, the CART algorithm includes a tree pruning step which is guided by cross-validation.

(17)

6 Random regression forests apply bootstrap aggregating (bagging) to create different models. Bagging generates a number of distinct training sets by sampling from the training set with replacement (bootstrapping). Each tree is grown using a modified version of the CART algorithm, which tests a random selection of features at each node instead of all the remaining features. The random forest predictions are an average of the predictions from the individual regression trees. The forest growth process is guided by the out -of-bag error, an estimate of the generalization error on each training sample, when predicted using the trees where bagging did not include the sample in their training set. This internal estimate is also used to determine feature importance.

Gradient tree boosting also usually uses CART trees as a base estimator. In GB, the gradient descent optimization method is used to minimize an arbitrary differentiable loss function, which, for regression, is usually least squares. The base estimators are trained in a sequential manner (boosting) , guided by the negative gradients of the loss function being minimized. This strategy is used to iterativel y transform weak learners into more predictive estimators. For each iteration, the base estimator is fit with a subset of samples that is randomly chosen without replacement from the full training dataset (Friedman 2002).

Partial least squares (PLS) (Wold 1985) regression is a statistical method that models the relationship between two data matrices based on a linear multivariate model. One of the matrices (X) contains predictor variables, and the other (Y) consists of one or multiple output variables. The PLS algorithm uses the features in the X matrix to form a small number of new variables, which are estimated as linear combinations of the original variables. These new variables are then used to predict the outputs. Before model fitting, the input data is typically log transformed, scaled, and centered (Wold et al. 2001).

When working with datasets with numerous features, as is the case for most bioinformatics problems, f eature selection is a crucial step. Feature

(18)

7 selection reduces model complexity, helping to avoid overfitting and, consequently, improving model generalization. It also makes the learning process faster, and it simplifies the interpretation of both the model and the data (Saeys et al. 2007). There are three main categories of feature selection methods: filters, wrappers, and embedded feature selectio n.

Filters only consider the properties of the dataset when evaluating feature importance, completely ignoring the interaction with the estimator . A feature relevance score is calculate d for all the variables. This score determines which subset of feature s are retained for model fitting, which only occurs after the selection step. Filter selection methods are fast and scalable. Most filters are univariate, which means that they ignore any dependencies between features. Multivariate filters take these depen dencies into account, but they are slower and less scalable than univariate methods.

Wrappers include a model fitting step within the feature selection process. Subsets of features are iteratively tested by training the estimator with the reduced dataset and evaluating model performance. Wrapper methods take into account interaction with the estimator, as well as feature dependencies, but they are typically more computationally intensive than filters.

In embedded feature selection, the search for an optimal subset of features occurs during model construction. In this case, model interaction and feature dependencies are not ignored, but the feature selection process is computationally less intensive than wrappers.

To assess the quality of a given predictive model it must be tested on a set of examples that were not used to train the model and for which the output values are known. This validation set can be obtained by simply dividing the input data into two sets, and holding out one of the sets when trainin g. Other ways to generate training and validation sets exist, such as cross-validation or bootstrapping.

K-fold cross-validation randomly partitions the input dataset into k subsets that are equal in size and mutually exclusive . The estimator is

(19)

8 evaluated k times, using one of the folds as the test set and the remaining folds as the training set. The cross-validated model performance is the average of the performance score calculated for the k iterations (Kohavi & Provost 1998). Leave-one-out cross-validation is the case where k is equal to the number of samples in the training set.

Bootstrapping is another resampling method that can be used to evaluate model performance. It creates a validation set by sampling from the original dataset with replacement.

While model parameters are optimized during model fitting, e stimator hyperparameters must be optimized before training, as they must be set before the model fitting step. Hyperparameter optimization can be achieved by performing an exhaustive search (grid search) across all (or a set of) possible combinations of user-specified hyperparameter values . Cross-validated performance metrics are used to evaluate each proposed set of hyperparameters. The chosen set of hyperparameter values will be the one that maximizes prediction performance. Random search is an alternative that only evaluates a number of hyperparameter combinations sampled from the entire search space. In high-dimensional search space it is more efficient than grid search (Bergstra & Bengio 2012).

2.2 Cancer therapeutics

2.2.1 Therapeutic Resistance in Cancer

Drug-based therapies are commonly used in cancer treatment. These include traditional chemotherapy which uses cytotoxic compounds that kill all rapidly dividing cells, and molecularly targeted drugs. The efficacy of these single-agent therapies is often reduced due to the existence of tumor drug resistance mechanisms (Holohan et al. 2013).

Drug resistance may be intrinsic or acquired. Intrinsic resistance is caused by the presence of resistance-conferring characteristics in the tumor cells prior to therapy, while acquired resistance develops during treatment as

(20)

9 an adaptive response of the tumor. Furthermore, owing to the heterogeneity of tumors, drug-resistant cells may be positively selected for during treatment (Holohan et al. 2013). The existence of drug-resistant cancer stem cells within the tumor is also problemati c. These tumorigenic cells may remain even after the more differentiated, non-tumorigenic tumor cells have been killed by conventional treatments, which can lead to relapse in the future (Housman et al. 2014).

A variety of mechanisms have been implicated in cancer drug resistance (Figure 2.1) (Holohan et al. 2013; Housman et al. 2014). There may be alternative, functionally redundant pathways that are activated once the targeted pathway is inhibited by the drug. Tumors undergoing treatment may adapt by activating survival signaling pathways or by suppressing death signaling pathways.

DNA repair mechanisms can also contribute to drug resistance, as they maintain cell survival by reversing the damage caused by compounds meant to have a disruptive effect on DNA.

Drug target alterations, such as mutations a nd expression level changes (usually overexpression) also affect drug re sponse, reducing sensitivity to treatment and ultimately resulting in drug resistance.

Many drugs used in cancer treatments need metabolic activation through complex mechanisms that involve interaction with several different proteins. In drug-resistant cells there may be a decrease or even lack of drug activation or the compound may be inactivated before it can affect its target (Housman et al. 2014).

Drug transport across the cell membrane may be altered, also impacting drug response. There may be a reduced uptake of the drug into tumor cells, or an overexpression of cell membrane proteins related to multidrug resistance that leads to an increase in the rate of drug removal (drug efflux) from the cell.

Given that epigenetic modifications, like DNA methylation and histone modification, play an important role in the regulation of gene expression , they can also contribute to drug resistance, by altering the expression levels of dr ug

(21)

10 targets, or proteins necessary for drug activation or DNA repair, among other effects (Housman et al. 2014).

Epithelial cells may experience morphological changes, transitioning to a more invasive phenotype, in a process called the epithelial to m esenchymal transition. This process appears to be linked to increased drug resistance (Housman et al. 2014).

The tumor microenvironment can also play an important role in drug resistance by protecting malignant cells from drugs, thus preventing cell death and providing cells with a greater opportunity to acquire resistance mechanisms (Holohan et al. 2013).

Figure 2.1 – Mechanisms of drug resistance (Holoh an et al. 2013; Housman et al. 2014) .

A common strategy to overcome drug resistance is th e administration of two or more drugs in combination. Combinatorial therapies may circumvent

(22)

11 pre-existing resistance mechanisms more easily, as well as prevent the development of acquired resistance mechanisms .

2.2.2 Combination Therapy & Drug Synergy

When multiple drugs are jointly administered, the combination effect can be classified based on the difference between the observed response and the response that would be expected assuming that the drugs do not interact with each other (Tang et al. 2015). If the resulting combination effect is th e expected effect, it is termed additive. A drug combination is considered synergistic if the drug response is enhanced beyond expected, whereas if the response is reduced when compared to the anticipated response, the combined effect is antagonistic. Ther efore, ideal drug combinations are those that produce synergistic effects. Drug synergy increases treatment efficacy without requiring an increase in drug dosage, potentially avoiding an increase in toxicity (Tallarida 2011).

Several reference models have been proposed to quantify drug combination effects. The definition of expected combination response differs between reference models, and, therefore, the identification of synergy/antagonism also varies between models. One of the most commonly used reference models is the Loewe additivity model (Loewe & Muischnek 1926). Loewe additivity postulates that a compound cannot have a synergistic interaction with itself, and that both compounds have similar mechanisms of action (Fitzgerald et al. 2006). Furthermore, the model requires information from the dose-response curves of the individual compounds to estimate the additive effect. The combination effect can then be determined through isobologram analysis (Tallarida 2006).

Potential drug combinations are typically experimentally screened in high-throughput cell viability assays (Borisy et al. 2003). Compounds are administered simultaneously at varying concentrations, and the resulting

(23)

12 combination effect is described by a dose-response surface, from which the single-agent dose-response curves can be derived.

2.3 Computational Methods in Drug Discovery

For both practical and financial reasons, experimental screening of all conceivable drug combinations to find the most effective compound pairs is currently unachievable (Bansal et al. 2014). To reduce the search space and experimental effort required, computational methods that predict drug response could be employed before experimental screening (Sun et al. 2013). A few methods have been described in recent years. Some of the proposed models apply systems biology approaches, such as protein -protein interaction networks and pathway analysis to study drug respon ses. Others predict compound synergy based on drug properties, or using “omics” data, such as genomic data, and some of these models use machine learning in their approach (Chen et al. 2015). Nevertheless, there are currently no established methodologies for the computational prediction of drug combination effects.

2.3.1 Previous DREAM Challenges

Before the AstraZeneca-Sanger challenge, the DREAM community had already launched two similar challenges in 2012, the DREAM7 - NCI-DREAM, Drug Sensitivity and Drug Synergy Challenges . The NCI-DREAM, Drug Sensitivity challenge (Costello et al. 2014) was aimed at building models to predict and rank the sensitivity of 18 breast cancer cell lines to 31 individual compounds. Training data for this challenge consisted of drug response data for 35 cell lines not included in the test set, and gene expression, mutation, copy number variation (CNV), methylation, and protein quantification data for the 53 cell lines, although some of these data types were not available for all of the cell lines.

A total of 44 predictive models were submitted for the Drug Sensitivity challenge. Analysis of the results revealed that the models that performed best

(24)

13 were nonlinear models, such as kernel methods and regression t rees. Gene expression microarray data was considered the most predictive of the datasets , with performance increasing when ad ditional datasets were also included. In addition to utilizing the datasets provided by the challenge, the best models also integrated prior knowledge in the form of biological pathway information. The study also found that predictive features could be found for most of the compounds.

The goal of the NCI-DREAM Drug Synergy challenge (Bansal et al. 2014) was to predict the effect of 91 drug combinations of 14 compounds on a diffuse large B-cell lymphoma cell line, OCI-LY3, ranking each of the compound pairs from the most synergistic to the most antagonistic. The challenge supplied drug response data at 24h for each individual compound, derived from the dose-response curves. The effect of each combination was evaluated in the same experiment from which the monotherapy responses were deriv ed. However, participants did not have access to inform ation on the combination effects for any of the drug pairs. This absence of direct training data prevented the use of traditional machine learning approaches. Other than the pharmacological data, parti cipants were given access to gene expression profiles determined before and after treatment with the individual compounds at two different concentrations and three time points after the beginning of treatment. Baseline gene expression for the cell line in normal growth media, profiled at the same three time points, was supplied as well . Additionally, a baseline single nucleotide polymorphism (SNP) profile was provided for the OCI-LY3 cell line.

The Drug Synergy challenge had a total of 31 submissions which adopted a wide variety of strategies . Some submissions assumed that compounds that resulted in similar gene expression profiles when administered individually were more likely to be synergistic, while other submissions assumed that compounds that led to di fferent expression profiles were more likely to be synergistic. Still others based their predictions on a combination of the previously mentioned hypotheses , or used more complex hypotheses to

(25)

14 predict synergistic combinations. The submissions also differed in the datasets that were used as input.

The prediction accuracy of most of the submitted models was not ideal , and only four models produced results that were statistically significant. The best models took advantage of the various parameters that can be derived from the dose-response curves, and assumed that the compounds might act in a sequential manner, despite being administered simultaneously. None of the models was effective in identifying both synergistic and antagonistic combinations, that is, models that performed well for synergistic cases did not perform well on antagonistic combinations, and vice versa . The merging of all of the methods in a “wisdom of crowds” approach resulted in a better performance than any of the isolated models. These results revealed that computational prediction of compound synergy is possible , but requires considerable improvement in methodology .

The challenge revealed that synergistic and antagonistic e ffects are highly dependent upon the genomic context of the cell lines, and not solely determined by the targets or the chemical and structural properties of the drugs. This highlights the importance of including genomic data when bu ilding drug combination models. Nevertheless, a clear relationship between the datasets that were used and model performance was not found when analyzing the submission results. The gene expression profiles measured 24 hours after drug administration seemed to have a slight effe ct on model performance, but it was statistically insignificant .

Several of the best drug combination prediction methods developed during the NCI-DREAM Drug Synergy challenge were subsequently published.

The Drug-Induced Genomic Residual Effect (DIGRE) model (Yang et al. 2015) was the best performing model in the challenge. It models synergy in a sequential manner, despite the fact that the drugs were administered to cells simultaneously. It assumes that when two compounds are sequentially administered, the transcriptional changes caused by the first compound

(26)

15 (transcriptomic residual effects ) have an impact on the effect of the second drug.

The DIGRE algorithm estimates the changes in gene expression prompted by treatment with the individual compounds, and calculates an expression similarity score between both drugs. The residual effect of the first drug on the second drug is then determined based on the expression similarity, and the drug combination effect i s calculated based on this residual effect. To improve model performance, the model considers a “focused view” which only contemplates differentially expressed genes that belong to specific cell growth pathways, and a “global view”, which takes into consideration genes that are upstream of the differentially expressed genes in cancer-relevant pathways.

The second best method (Goswami et al. 2015) assumes that the effect of an individual drug is associated with the expression change of a particular set of genes - core genes. The algorithm starts by determining which genes are differentially expressed after treatment comparatively to baseline expre ssion. The set of core genes is subsequently created by selecting all of the genes that are significantly differentially expressed in at least one of the individual drug treatments. The authors noted that the model performed better using this core set when compared to using only target genes. The gene expression profiles after treatment, limited to the set of core genes, are then compared. A score is calculated for each drug combination based on the similarity of the profiles , and the combinations are ranke d. A drug interaction score is likewise calculated based on the dose-response curves for each of the compounds , and the combinations are ranked according to this score as well. The final ranking is obtained by comparison of the two rankings.

Another method (Zhao et al. 2014), which ranked fourth in the challenge, exclusively used the baseline expression profile and the expression profiles obtained at different time points after treatment with each individual compound to rank the drug combinations. The model is based on the hypothesis that the mechanism of action of an individual compound is reflected in the set of genes that are differentially expressed after treatment with that

(27)

16 compound. The authors found that these differential gene signa tures are more informative than randomly selected genes, and that the frequently selected gene signatures were enriched in genes associated with DNA metabolism, cell cycle processes, and the p53 signaling pathway.

The method also assumes that the correlation between the differentially expressed gene sets produced by two drugs expresses the combination effect. According to this hypothesis, compounds with similar differential gene expression profiles after drug administration may be synergistic, whereas two drugs with different sets of dif ferentially expressed genes may have an antagonistic effect. Nevertheless, the author s admit that the model, based exclusively on gene expression profiles of individual drugs, did not perform well when the effects of individual drugs are different. They also considered that incorporating more types of cell-response information is necessary to improve prediction accuracy.

The experimental dataset released with the NCI -DREAM Drug Synergy challenge was originally a validation dataset for the SynGen (Bansal et al. 2014) algorithm. Similar to the submissions for the NCI-DREAM challenge, this method does not employ machine learning, but unlike the challenge models, it is exclusively meant for the prediction of synergistic combinations. The algorithm is based on the idea that the activity of “master regulators”, which are required for the preservation of a phenotype -specific gene expression signature, is crucial for the maintenance of cell viability. When the activity of cell state master regulators is suppressed or master regulators of cell death are triggered, there may be a loss of viability. Therefore, the first step of the SynGen method is deducing the master regulator activity patterns for both cell state and cell death of the OCI -LY3 cell line. The model then identifies synergistic drug combinations as those where both compounds affect the activity of these master regulators in a similar manner.

(28)

17

2.3.2 Other Related Work

In 2011, Zhao et al. (Zhao et al. 2011) proposed a model to identify effective drug pairs by comparing newly proposed combinations wi th already approved drug combinations. In the model, drug combinations ar e characterized by pairs of chemical and pharmacological features of the individual drugs, such as target and pathway information, side effects, medical indication areas, and Anatomical Therapeutic Chemical codes. Each feature pair of the candidate drug combination is compared to the feature pairs of known combinations. A score is calculated to measure the similarity between candidate pairs and known combinations, based on the feature pairs that are enriched in effective drug combinations. For this model, the more similar a drug pair is to known combinations, the more likely it will be an effective combination.

The Probability Ensemble Approach (Li et al. 2015) also predicts effective drug combinations by comparing novel compound pairs to known drug combinations, based on molecular and pharmacological features of the compounds and their targets. Besides predicting the probability of a drug combination being effective, the method also predicts the probability of being an undesirable drug-drug interaction. The similarity of a candidate pair to a known combination is estimated using six s imilarity measures, contemplating chemical similarity, side effect profiles, Anatomical Therapeutic Chemical codes, target sequence similarity, Gene Ontology semantic similarity of the targets, and the distance of the targets on a protein-protein interaction network. A Bayesian network is employed to determine the probabilistic similarity of the candidate combination to the known interactions based on these similarity measures.

Another proposed machine learning model (Lozano 2013) represents drug combinations as vertices on a graph, and then ranks candidate drug combinations by their effectiveness. The model is trained using a small list of known drug combinations with experimentally determined synergy scores. A

(29)

18 polynomial kernel is used to represent the similarity between drug combinations.

The Target Inhibition Interaction using Maximization and Minimization Averaging (TIMMA) R package (He et al. 2015) is a logic-based network algorithm for the prediction of synergy scores. The model assumes that the effects of a drug combination can be inferred from all of their drug-target interactions, which are used as input. Cell line or patient -derived sensitivity profiles for each of the individual drugs are also used as inputs. In addition to predicting drug synergy, the model produces a visualization of the target inhibition network.

Jin et al. (Jin et al. 2011) created an enhanced Petri Net model, a graph-based model in which nodes represent places and transitions, to predict synergistic effects of drug combinations. The proposed method requires gene expression data obtained after the administration of a given drug combination, as well as the expression profiles following treatment with each drug on its own. Using these expression profiles, the model simulates the impact of the combination and single-drug treatments on the signaling network downstream of the therapeutic targets. A synergistic interaction between the drugs is identified based on the comparison between the drug combination effects and the effects of the individual compounds.

DrugComboRanker (Huang et al. 2014) reconstructs a drug functional network from publically available gene expression profiles of several cell lines before and after treatment with a variety of compounds, with th e purpose of identifying “communities” of drugs that induce similar responses to treatment, thus having similar mechanisms of action. T he drug functional network is subsequently used to determine the functional targets of drugs based on their expression profiles. A disease-specific signaling network is also reconstructed from gene expression and protein interactome data. D rug combinations are then ranked by considering three distinct scores: the relatedness of the targets in the signaling network, the dissimilarity of the gene expression profiles of

(30)

19 different drugs, and the semantic similarity of the gene ontologies of the targets.

(31)

20

Chapter 3 Materials and Methods

3.1 Challenge Description

Recently, the DREAM community launched the AstraZeneca -Sanger Drug Combination Prediction DREAM challenge, in collaboration with AstraZeneca, the European Bioinformatics Institute, the Sanger Institute, and Sage Bionetworks. It ran from September 3r d_{2015 to March 21}s t_{2016. The main}

goals of this crowdsourcing initiative were to expa nd knowledge on the mechanisms and factors responsible for drug synergy, ideally translated into a list of biomarkers relevant for drug synergy, and to further the development of synergy prediction methods. To accomplish this, AstraZeneca offered challenge participants access to unreleased data on drug combination experiments involving 118 different compounds, combined with one another and screened over 85 cancer cell lines, and baseline (before treatment) genomic data for each of the cell lines provided by the Genomics of Drug Sensitivity in Cancer (GDSC) (Yang et al. 2013) and Catalogue of Soma tic Mutations in Cancer (COSMIC) (Forbes et al. 2015) projects at the Sanger Institute.

(32)

21 The challenge was subdivided into three subchallenges (Figure 3.1), with the purpose of simulating different clinical scenarios. All of the challenges focused on creating models for synergy prediction, but each subchallenge addressed different clinical scenarios.

Subchallenges 1A (Figure 3.2) and 1B (Figure 3.3) are classic machine learning problems, more specifically regression problems, since in this case drug synergy is expressed as a continuous variable. The main goal of subchallenges 1A and 1B is to develop predictive drug synergy models by training on known data. Participants had to predict the synergy values of 167 drug combinations screened across 85 cell lines. The data for subchallenge 1 was divided into 3 sets: a training data set , contemplating half of the original dataset, a leaderboard set (1/6 of the dataset), and a validation set for the final scoring round, which corresponds to the remaining 1/3 of the dataset. In each leaderboard round, different subsets of the original leaderboard s et were used to score submissions.

Figure 3.2 – General flowchart for subchallenge 1A. The chart also gives a detailed list of

the datasets that can be used to train predictive models. Adapted from (Yu & Menden 2015).

Although subchallenge 1A and 1B are similar, they differ in terms of data use restrictions. In subchallenge 1A, participants were allowed to use the

(33)

22 data provided by the challenge in its entirety to predict synergy values, as well as being able to leverage any data from external sources.

In subchallenge 1B, direct input was limited to mutation and CNV data, and any prior knowledge that would not have required any additional experimental effort to obtain, such as putative drug targets, pathway information, and chemical descriptors. Any knowledge gained from the remaining datasets could solely be used in an indirect manner, to infer prior knowledge to aid feature selection. By restricting the input data, subchallenge 1B more closely resembles a clinica l setting with limited access to multiple platforms for experimental profiling of samples. Mutations and CNVs were chosen since both datasets may be obtained from a single DNA sequencing platform, and also because they are more likely to be important cance r biomarkers.

Figure 3.3 – General flowchart for subchallenge 1B. The chart gives a detailed list of the

datasets that can be used to train predictive models, explicitly stating that no experimental readout may be used other than mutational and copy number variation data. Adapted from (Yu & Menden 2015).

The goal of subchallenge 2 was to develop models that predict drug synergy without using any direct drug synergy training data, thus simulating a common clinical scenario in which treatment decisions are made based on prior knowledge alone. As in subchallenge 1A, all available data was permitted

(34)

23 to be used as direct input, but information on the monotherapy and combination effects was not provided for the drug comb inations. Participants were, however, allowed to take advantage of the known synergy values from subchallenge 1 to identify rules and for the purpose of feature selection. The subchallenge 2 dataset comprises 740 drug combinations that do not coincide with those allocated to subchallenge 1. The dataset was split into a leaderboard set with 370 combinations, and a test set with 370 combinations. The leaderboard and test sets are mutually exclusive. Figure 3.4 depicts the desired output of subchallenge 2.

Figure 3.4 – The output that is generated for subchallenge 2. Adapted from (Yu & Menden

2015).

Both subchallenge 1 and 2 had several leaderboard rounds, where teams could submit up to three different predictions for the leaderboard set for each subchallenge, and a final submission round, in which predictions for the test set could only be submitted once. The timelines for subchallenges 1 and 2 are shown in Figure 3.5.

(35)

24

Figure 3.5 – The official timelines for subchallen ges 1A, 1B and 2 of the

AstraZeneca-Sanger Drug Combination Prediction DR EAM challeng e. Adapted from (Yu 2015).

3.2 Datasets

The datasets supplied by the DREAM challenge and the external datasets used to supplement the challenge data are summarized in Table 3.1.

The AstraZeneca-Sanger Drug Combination Prediction challenge provided participants with data on approximately 11,500 drug combination therapy experiments involving 118 different drugs and 85 cell lines. Pharmacological data was contributed by AstraZeneca. Challenge participants were supplied with comma separated values (CSV) files containing information on each drug combination experiment, including cell line name, the compounds tested and their respective doses, and the experimentally determined combination and monotherapy effects to be used as the training set in subchallenges 1A and 1B. Similar files were provided for the leaderboard sets, although the synergy

(36)

25 scores, which had initially been withheld from challenge participants, were only released at a later phase.

Table 3.1 – Datasets used in this work.

Type Data Description Source Missing _data? Challenge data Pharmacological Combination & monotherapy dose-responses (Therapy)

Synergy values and single-agent response curve parameters derived

from the dose-response surfaces of high-throughput screened drug combinations. AstraZeneca No Drug data Dataset containing the putative targets, and chemical and structural properties of each compound. AstraZeneca Yes Molecular Mutations Sequence variants called from whole-exome sequencing. GDSC project (Sanger Institute) No CNVs Results of CNV surveys across all cell lines, both at the segment level and at the gene

level.

GDSC project (Sanger

Institute) No Gene

expression microarray data. RMA-normalized

GDSC project (Sanger Institute) Yes Methylation Data from methylation arrays. Methylation status of CpG sites and CpG islands is expressed as beta-values and M-values. Esteller group (IDIBELL) Yes

Other _informationCell line Characterization of _{each cell line.} _InstituteSanger No

External data

Molecular _expressionGene

RMA normalized gene expression

data.

(37)

26 In each experiment, cell viability was measured at different drug concentrations, and the ratio of the number of live cells after drug administration to the number of cells in control conditions (absence of drug) was plotted as a response surface, from which the single-agent dose-response curves were derived.

The resulting combination and monotherapy dose-responses were analyzed using Combenefit (Jodrell Group 2015), which uses the Loewe additivity model as the basis of its calculation of the synergy distribution. The synergy distribution was then integrated in logarithmic concentration space to obtain a total-synergy score, which is used to express the combination effect. In the CSV file, the single-agent effects are represented by the Hill equation parameters of the dose-response curves for both of the combination compounds.

Besides the experimental data, additional information was provided for each of the tested compounds. Putative targets were provided for all drugs. A few chemical properties (molecular weight, H-bond acceptors, H-bond donors, calculated log P, and Lipinski's rule of 5) and structural information in the form of SMILES (Simplified Molecular Input Line Entry Specification) were also given for some of the compounds. Several compounds that lack chemical and structural data had associated PubChem identifiers, which were later used to obtain that information. Additional chemical descriptors and fingerprints were calculated using the software PaDEL-descriptor (Yap 2011), version 2.21.

Challenge participants were also given access to baseline genetic and genomic data for each of the cell lines used in the experiments, as well as information on the tissue of origin. Mutational profiles generated by the GDSC project at the Sanger Institute were supplied for all 85 cell lines, summarized in a CSV file. These mutations were called from whole exome sequencing (Agilent SureSelect/Illumina HiSeq 2000) data using the software CaVEMan (Stephens et al. 2012) and PINDEL (Ye et al. 2009).

Copy number variation data from the GDSC project were provided at both the segment and gene level in CSV files. The copy number analysis was

(38)

27 performed with Affymetrix SNP 6.0 arrays and CNVs were identified with the PICNIC algorithm (Greenman et al. 2010), using the GRCh38 human genome assembly as the reference.

Gene expression data, measured on Affymetrix Human Genome U219 array plates and Robust Multi-array Average (RMA) normalized using the affy package in R (Gautier et al. 2004), were contributed by GDSC as well. Expression data was missing for two of the cell lines (MDA-MB-175-VII and NCI-H1437). In these cases, RMA normalized gene expression data generated on Affymetrix U133 Plus 2.0 arrays by the Cancer Cell Line Encyclopedia (CCLE) project (Barretina et al. 2012) were used instead.

CpG methylation data were generated on the Illumina Infinium HumanMethylation450 v1.2 BeadChip by the Esteller group at the IDIBELL Institute. The methylation status of the CpG sites is expressed as either beta or M values in two separate files. CpG sites were also compressed into CpG islands as defined by the University of California, Santa Cruz genome browser, and the corresponding M and beta values were p rovided in two distinct files. Methylation data were missing for three cell lines (SW620, KMS-11 and MDA-MB-175-VII).

An additional dataset, containing information on the tissue of origin for each of the cell lines, was also made available to participants.

3.3 Pipeline Developed in this Work

Despite the differences between each of the subchallenges, similar steps are followed in all of them to create models and predict test cases . The machine learning pipeline consists of several data preprocessing steps, feature selection, parameter optimization (an optional step), model evaluation, and prediction of the test set output variable, as can be seen in Figure 3.6.

(39)

28

Figure 3.6 – Pipeline developed in this work. a) Flowchart of the machine l earning

pipeline that was used to evaluate predictive models and obtain predict ions, highlighting the different modules that were used; b) Steps performed by the “load_save_data.py” script when loading and preprocessing each dataset.

The evaluateAndPredict function in the “machine_learning.py” module executes the entire pipeline, according to user-defined options that are specified in a dictionary. These options determine the datasets to be used as input, the estimator, with its respective hyperparameters, and any hyperparameters to be optimized before model fitting, the feature selection method to be applied along with the number of features to keep, and the type

(40)

29 of subchallenge that is under consideration. A detailed description of each of the steps in this pipeline is given in the following sections.

3.3.1 Data Preprocessing

All data files are loaded and manipulated with scripts using the Python pandas (McKinney 2010) package, version 0.17.1. Functions to load and encode data in a format suitable for machine learning can be found in the load_and_save_data.py script.

Filtering based on quality assessment (QA) is performed on the pharmacological data. Each drug combination experiment has an as sociated quality value. A QA value of 1 indicates that the data for that experiment is reliable. Therefore, all experiments that have a QA value other than 1 are ignored.

When the chemical and structural properties of the drugs are used as features, the input dataset is split into three distinct datasets, due to the absence of these data for some compounds. The first dataset is limited to experiments in which both compounds lack these additional pharmacological data. In this case, these additional drug f eatures are ignored. The second dataset contains all the experiments in which only one of the compounds has chemical and structural information available. These features are simply added to the input data. The third dataset includes all experiments with chemical and structural data available for both compounds. In this case, the drug data is preprocessed to create new features which summarize the information for both compounds using a single value. Numerical drug features are expressed as the absolute difference of the values for bo th compounds. Binary features are encoded using a three value system, in which zero indicates that both compounds originally had a value of zero for a given feature, 1 indicates that both compounds had a value of 1 , and -1 indicates that the compounds had different values for a particular feature.

The mutation dataset is processed as follows: known SNPs, which are flagged as ‘y’ in the mutations file, are eliminated, as these mutations have

(41)

30 also been removed from the web version of the COSMIC database. Mutations that are identified as passenger mutations by the Functional Analysis through Hidden Markov Models (FATHMM) tool (Shihab et al. 2013) are also removed from the dataset.

Additional information on the putative effects of the mutations was obtained by submitting the Human Genome Variation Society (HGSV) notations (“Mutation.CDS” column in the muta tions.csv file) as input to the Variant Effect Predictor (VEP) tool (McLaren et al. 2010) from Ensembl. PolyPhen predictions from the VEP results are used to filter out point mutations identified as “benign”. The remaining data, therefore, includes mutations that were either identified as “damaging” by FATHMM or PolyPhen, or for which no effect could be predicted.

The mutation data is then converted into a matrix of cell lines versus genes. For a given cell line, the existence of only synonymous substitutions or the complete absence of mutational data f or a certain gene is encoded as zero; thus, mutations that had been previously discarded because they had been identified as “passenger”/“benign” are also encoded as zero. The remaining non-synonymous substitutions, insertions and deletions are encoded as 1.

A list containing 518 known cancer genes a nd 1,053 candidate cancer genes downloaded from the Network of Cancer Genes (NCG) 5.0 database (An et al. 2016) is used to limit the features in the final mutation dataset to genes that are likely to be tumor drivers.

For the purpose of this work, only gene level copy number information is considered. Genes located on the Y chromosome are rejected, as the sex of a cell line’s donor impacts gene copy number. A gene is then classified as amplified, deleted or wild-type according to the GDSC/COSMIC definition which specifies that a gene is considered amplified if there are at least 8 copies, and deletions are strictly homozygous deletions. In both cases, the entire coding sequence must be contained in one contiguous segment. The original dataset is, therefore, transformed into a cell line versus gene matrix in which amplifications are encoded as 1, deletions a re encoded as -1, and the

(42)

31 remaining CNVs are encoded as zero, similar to what is described by Menden et al (Menden et al. 2013). The final CNV data set is also filtered using the NCG 5.0 list of cancer genes.

The computational gene sets and oncogenic signatures datasets from the Molecular Signatures Database (MSigDB) (Adzhubei et al. 2010) were used to compile a list of genes that are usually differentially expressed in cancer. This list is used to reduce the dimensionality of the gene expression dataset.

The CpG islands file containing M values was selected to be used as the methylation dataset, since CpG islands usually occur in gene promoter regions and are therefore likely to be more relevant to the regulation of gene expression. M values were favored due to their reduced heteroscedasticity (Du et al. 2010) when compared to beta values.

Tissue of origin for each cell line is extracted from the cell line information file, creating a new dataset consisting of cell line and the respective tissue labels.

The preprocess function in the “machine_learning.py” module constructs the training and test sets and carries out additional preprocessing steps that are necessary before performing machine learning tasks using the scikit-learn (Pedregosa et al. 2012) Python package (version 0.17.0). User-selected datasets are merged into a NumPy (van der Walt et al. 2011) array with pandas. If gene expression data is included, the expression values are filtered by variance, using the VarianceThreshold class from the scikit-learn preprocessing module and a threshold of 0.2. This threshold value was chosen because it was close to the median of variances across genes (0.27). Cell line names are removed, and for subchallenge 2 compound names are removed as well. Any categorical features are encoded using a “1-of-N” encoding by applying the pandas get_dummies function.

Any missing values that might be present are imputed with the median value of the respective feature using the Imputer class from scikit-learn. Features with zero variance, that is, constant value across all rows, are removed using the scikit -learn VarianceThreshold function and a threshold of