• Nenhum resultado encontrado

Assessing the impact of alternative splicing in thyroid cancer

N/A
N/A
Protected

Academic year: 2021

Share "Assessing the impact of alternative splicing in thyroid cancer"

Copied!
77
0
0

Texto

(1)

F

ACULDADE DE

E

NGENHARIA DA

U

NIVERSIDADE DO

P

ORTO

Assessing the Impact of Alternative

Splicing in Thyroid Cancer

Luís Filipe Correia Cleto

Mestrado Integrado em Engenharia Informática e Computação Supervisor: Rui Camacho (FEUP)

Co-supervisor: Valdemar Máximo (I3S/IPATIMUP & FMUP) Co-supervisor: Pedro Ferreira (I3S/IPATIMUP)

(2)
(3)

Assessing the Impact of Alternative Splicing in Thyroid

Cancer

Luís Filipe Correia Cleto

Mestrado Integrado em Engenharia Informática e Computação

Approved in oral examination by the committee:

Chair: Eugénio Oliveira

External Examiner: Sérgio Matos Supervisor: Rui Camacho

(4)
(5)

Abstract

All over the world there are millions of people affected by cancer. There are several possible causes for this disease, one of which is a genomic origin. While there are many alterations in cancer cells that distinguish them from healthy ones, there is also an abundance of aberrant mRNA transcripts in cancerous cells. These aberrant transcripts might lead to changes in the synthesized proteins which, in turn, will confer new, and possibly devastating, properties to the affected cell. One possible source of this defective mRNA is an abnormal gene splicing.

The advent of Next-Generation Sequencing (NGS) techniques has revolutionized the field of molecular biology in the past few years and brought many new approaches for cancer research. The goal of this thesis is to make use of these novel techniques, namely RNA-seq, in order to create a tool capable of detecting events of aberrant alternative splicing and assessing their impact on a cell or an organism’s transcriptome. We have extended our initial goal and expanded our analysis with extra studies that are considered valuable by expert biologists in the study of cancer. The extra developments include the identification of fusion genes.

However, current tools of this type are commonly complex to use, require powerful computa-tional resources and also need the user to be familiar with programming or other concepts from the field of informatics, which is often not the case for geneticists or molecular biologists. As such, this project also focuses on enabling these researchers to use the aforementioned tools through the creation of a distributed system which provides the computational resources necessary for an RNA-seq pipeline as well as a web interface to facilitate configuring and running experiments on this pipeline.

Lastly, the thesis proposal is evaluated on real data provided by IPATIMUP and also on data obtained from the ENCODE project.

(6)
(7)

Resumo

Por todo o mundo há milhões de pessoas afetadas pelo cancro. Existem múltiplas causas possíveis para esta doença, algumas das quais são uma origem génica. Enquanto que há muitas alterações em células cancerígenas que as distinguem das saudáveis, existe também uma abundância de registos de mARN aberrantes em células cancerígenas. Estes registos aberrantes irão levar a alterações nas proteínas sintetizadas que, por sua vez, poderão conferir às células propriedades novas, e potencialmente devastadoras. Uma possível origem deste mARN defeituoso é o splicing anormal de genes.

O advento das técnicas de sequenciação de nova geração tem revolucionado o campo da biolo-gia molecular nos últimos anos e trouxe inúmeras novas abordagens para investigação do cancro. O objetivo desta tese é recorrer a estas novas técnicas, nomeadamente à técnica RNA-seq, de modo a criar uma ferramenta capaz de detetar eventos de splicing alternativo aberrante e avaliar o seu impacto no transcriptoma da célula ou do organismo. Conseguimos ainda ultrapassar os objetivos iniciais e incluir no estudo a deteção de genes de fusão, considerados muito relevantes para o aparecimento do cancro.

No entanto, ferramentas modernas deste tipo são frequentemente complexas de utilizar, neces-sitam de recursos computacionais poderosos e também exigem conhecimentos de programação e outros conceitos da área da informática por parte do utilizador, o que frequentemente não se verifica para geneticistas ou especialistas de biologia molecular. Como tal, este projeto também se foca em permitir a estes investigadores fazer uso das ferramentas mencionadas através da criação de um sistema distribuído que fornece os recursos computacionais necessários para uma pipeline de RNA-seq e também uma interface web para facilitar a configuração e execução de experiências nesta pipeline.

Finalmente, a proposta de tese é avaliada em dados reais fornecidos pelo IPATIMUP e também em dados obtidos do projeto ENCODE.

(8)
(9)

Contents

1 Introduction 1

1.1 Problem Domain . . . 1

1.2 Motivation and Objectives . . . 2

1.3 Project . . . 3

1.4 Dissertation Structure . . . 3

2 Biological Key Concepts and Technology Survey 5 2.1 Basic Biological Concepts . . . 5

2.1.1 Glossary . . . 5

2.1.2 Genetic Processes . . . 6

2.1.3 Alternative Splicing . . . 7

2.2 RNA Sequencing and Transcriptome Assembly . . . 9

2.3 Relevant Standard File Formats . . . 12

2.4 Tools for RNA-seq . . . 15

2.4.1 RNA-Seq Pipelines . . . 17

2.5 Data Repositories . . . 19

2.6 Additional Technologies . . . 19

2.7 Chapter Conclusions . . . 21

3 The eRAP Framework 23 3.1 eRAP . . . 23 3.2 System Architecture . . . 24 3.2.1 Data Scheme . . . 25 3.3 Use Cases . . . 27 3.4 Web Interface . . . 27 3.5 Analysis Server . . . 30 3.6 Case Studies . . . 31 3.6.1 Standard Experiment . . . 31

3.6.2 Experiment With Errors . . . 32

3.7 Chapter Conclusions . . . 32

4 RNA-seq Data Analysis 33 4.1 Gene Fusion Analysis . . . 33

4.2 Alternative Splicing Analysis . . . 33

4.2.1 Junctions Comparer . . . 34

4.2.2 Usage . . . 34

4.2.3 Junction Identification . . . 35

(10)

CONTENTS

4.2.5 Case Study . . . 37

4.3 Gene Expression Quantification . . . 38

4.3.1 Output . . . 38 4.4 Chapter Conclusions . . . 39 5 Case Study 41 5.1 Samples characterization . . . 41 5.2 Research Objectives . . . 41 5.3 Analysis Protocol . . . 42

5.4 Genome version GRCh37 versus GRCh38 . . . 42

5.5 Results . . . 43

5.5.1 Gene Expression Analysis . . . 43

5.5.2 Alternative Splicing Analysis . . . 44

5.5.3 Gene Fusion Analysis . . . 44

5.6 Chapter Conclusions . . . 45

6 Conclusions and Future Work 47 6.1 Conclusions . . . 47

6.2 Future Work . . . 48

References 49 A R Analysis 53 A.1 Dependencies . . . 53

A.2 Script Code . . . 53

B Experiment Results 57 B.1 Gene Expression Results’ Descriptions . . . 57

B.2 Alternative Splicing Results . . . 58

(11)

List of Figures

2.1 Simplified overview of the gene expression transcription and splicing processes

and the structure of the genetic molecules involved . . . 7

2.2 Overview of the gene expression translation process and the structure of the ge-netic molecules involved . . . 8

2.3 The alternative splicing process. The colored blocks represent exons in the RNA sequence while the lines between them are introns . . . 9

2.4 Traditional classification of basic types of alternative splicing. In these graphics, exons are represented by boxes and introns by lines. Regions included in the messages by alternative splicing are yellow while constitutive exons are shown in blue . . . 10

2.5 Illustration of gene fusion events at a chromosomal level . . . 11

2.6 General RNA-seq workflow diagram . . . 12

2.7 iRAP pipeline architecture . . . 18

3.1 System Architecture . . . 24

3.2 System Use Cases . . . 28

3.3 Landing page of the UI module . . . 29

3.4 Experiments index of the UI module . . . 29

3.5 Experiment pages for a successful and a failed experiment . . . 30

3.6 Experiment creation page . . . 31

4.1 The graphic on the left shows the execution time for Junctions Comparer and MATS. The middle column includes the pre-processing time required to prepare the BAM samples for JC. The graphic on the right compares the results obtained by the two tools . . . 37

5.1 Most significant differentially spliced genes between wFTC and mFTC sample groups. Each graphic shows the normalized amount of alternative splicing events per sample for each gene. These results were obtained using the GRCh38 genome assembly . . . 45

(12)
(13)

List of Tables

2.1 Tools integrated in the iRAP pipeline for each standard processing stage . . . 18

3.1 MongoDB Collections . . . 25

3.2 Variables to be declared in the local_settings.py file . . . 28

4.1 Junctions Comparer Arguments . . . 35

4.2 Junction Start Classification . . . 36

4.3 Junction End Classification . . . 36

5.1 Gene Expression Results for both sample groups using GRCh37 and GRCh38. The values for each column represent the average of the normalized fraction of reads that mapped to that gene in each sample . . . 43

5.2 Gene Expression results for both sample groups using GRCh38. The values for the second and third column represent the average of the normalized fraction of reads that mapped to that gene in each sample. The last column represents the largest difference factor calculated as MAX(wFTC/mFTC, mFTC/wFTC) . . . 44

5.3 Gene Fusion events detected in the original study and by EricScript using both GRCh37 and GRCh38. The gene symbols used to identify the fusion genes are the ones used in the older annotation for compatibility with the previous paper . . 46

B.1 Short description of genes that are differentially expressed between samples mFTC and wFTC . . . 58

B.2 Alternative Splicing analysis results for both sample groups using GRCh38. The values for the third and fourth column represent the average of the normalized number of alternative splicing events that took place within that gene in each sample. The last column represents the largest difference factor calculated as MAX(wFTC/mFTC, mFTC/wFTC) . . . 59

B.3 Short description of genes that show differential alternative splicing between sam-ples mFTC and wFTC . . . 61

(14)
(15)

Abbreviations

AS Alternative Splicing A3S Alternative 3 Prime A5S Alternative 5 Prime BAM Binary Alignment/Map BED Browser Extensible Data cDNA Complementary DNA DNA Deoxyribonucleic Acid

ENCODE Encyclopedia of DNA Elements ES Exon Skipping

FTP File Transfer Protocol GFF Generic Feature Format GLPv3 General Public License 3 GTF General Transfer Format HTML Hypertext Markup Language HTTP Hypertext Transfer Protocol

I3S Instituto de Investigação e Inovação em Saúde

IPATIMUP Instituto de Patologia e Imunologia Molecular da Universidade do Porto IR Intron Retention

iRAP Integrated RNA-seq Analysis Pipeline MEE Mutually Exclusive Exons

MATS Multivariate Analysis of Transcriptome Splicing mRNA Messenger RNA

mtDNA Mitochondrial DNA

NCBI National Center for Biotechnology Information NGS Next Generation Sequencing

RNA Ribonucleic Acid RNA-seq RNA sequencing rRNA Ribosomal RNA

SAM Sequence Alignment/Map SNP Single Nucleotide Polymorphism SSL Secure Sockets Layer

TLS Transport Layer Security tRNA Transport RNA

URL Uniform Resource Locator

(16)
(17)

Chapter 1

Introduction

In this chapter we will provide an overview of the context to which this thesis belongs. We will present our objectives with this work as well as the underlying motivations. A brief analysis of the developed project will be explained as well as the structure of the remainder of this report.

1.1

Problem Domain

Molecular biology is a branch of science that studies biological activity at the molecular level. It overlaps with biology and chemistry and in particular, genetics and biochemistry. A key area of this field concerns the understanding of how DNA, RNA and the synthesis of proteins function and how this affects the various cellular systems [Man14]. One of the processes studied by molecular biology is gene expression. During this process (further elaborated in Chapter2), DNA sequences are translated into RNA that is used to synthesize proteins. Using this process DNA indirectly regulates the chemical activity of cells. Deregulation of gene expression is known to play a pivotal role in the origin of cancer [GYLL13].

One such possible deregulation is the compromise of splicing mechanisms, which allow a sin-gle gene to code different proteins. This will cause changes to the proteome (total set of proteins) of cells and possibly confer them new properties.

Next-Generation Sequencing(or Massively Parallel Sequencing) methods have been revolu-tionizing genomics and, in turn, genetics. Thanks to these methods, disciplines such as molec-ular biology are no longer restricted to small-scale genetic approaches but are able to introduce genome-wide thinking. As such, coupling these tools with the appropriate algorithms has the po-tential to radically alter our understanding of model organisms and ultimately of ourselves [Mar08]. One possible application of these instruments is in cancer research. While there have been many advances in our knowledge of cancer over the past decades, there is much that remains unknown about the mechanisms involved in its origin(s). The primary objective of this thesis is to resort to RNA-seq [WGS09], an NGS approach for transcriptome profiling, in order to create a tool capable of identifying and assessing the impact of aberrant alternative splicing events as a

(18)

Introduction

possible origin for cancer. Additionally, it is also intended to measure the impact of gene fusion events for a more in-depth analysis of the alterations present in cancer cells.

However, current RNA-seq tools are commonly complex to use, require powerful computa-tional resources and also need the user to be familiar with programming or other concepts from the field of informatics, which is often not the case for geneticists or molecular biologists. As it is intended to allow experts in the field (who may not have much programming knowledge) to conduct their own analysis through the use of this, and other tools, the creation of a user-friendly web-based application for interacting with the analysis system emerges as an additional objec-tive. The application consists of a web interface that allows access to a distributed system which provides the computational resources necessary for an RNA-seq pipeline and a database where to archive results.

1.2

Motivation and Objectives

All over the world there are millions of people affected by cancer. There are several possible causes for this disease, one of which is a genomic origin. While there is a myriad of alterations in cancer cells that distinguish them from healthy ones, one trait that is common to all cancerous cells is an abundance of aberrant mRNA transcripts. One possible source of this defective mRNA is abnormal gene splicing. Alternative splicing is the regulated process during gene expression that results in a single gene coding for multiple proteins. A deregulation in normal splicing, caused by cancer-specific defects in the splicing mechanisms, can confer new properties of growth, differentiation or other characteristics to the cancer cell [FG08].

With the emergence of high-throughput sequencing there have been substantial advances in cancer genomics research. The goal of this project is the development of a tool, based on RNA-seq, that uses high-throughput information from the transcriptomes of thyroid cancer samples in order to identify events of aberrant splicing and assess their impact in the structure of the transcript. For this purpose, the RNA-seq technology was used for sequencing the genomes. RNA-seq performs the reconstruction of at least part of the genome of an individual through the reading of small fragments, calculates the sets of active and inactive genes and compares it with another reference genome (gene differentiation).

Further extensions to this tool have already been proposed such as focusing the analysis on mi-tochondrial DNA, as this is passed down from generation to generation without any modifications (except for rare mutations), making it more stable over time [ED]. It is also in the researchers’ in-terests to incorporate information relative to events of gene fusion into the analysis to allow further insight into the modifications present in cancer cells.

However, it is common for tools of this type to require powerful resources and be complex to use, especially for molecular biologists or geneticists who might not have a programming back-ground. As such, it is also relevant to create a web application to serve as an interface for a distribute system where an RNA-seq pipeline is deployed.

(19)

Introduction

In a final stage to further assess the developed tool, we have used thyroid cancer data retrieved from I3S/IPATIMUP and also from the ENCODE project [Ins].

1.3

Project

This work focuses on the development of a prototype tool for detecting aberrant alternative splicing events and measure their impact on the organism by assessing the changes to the transcriptome induced by these events. This tool was used on real data provided by I3S/IPATIMUP and the ENCODE project in order to validate it and provide new insights on the origins of cancer. In order to generate more complete results, gene fusion and gene expression analyses were also conducted on the provided samples.

As we intend to enable the usage of RNA-seq tools of this type by researchers and experts in genomics without programming knowledge, the development of this project also encompasses the creation of a web interface for user interaction and a back-end that is capable of handling the analysis.

Analysing the transcriptome is a computationally intensive task. As such, the analysis module should be implemented in such a way that allows it to be separated from the rest of the system as it, ideally, will be running on a dedicated cluster with powerful machines. Our tool exposes a web API that allows executing and monitoring tasks as well as transferring data and results. In this system a pipeline for analysing data via RNA-seq will be deployed.

Another service was created for handling the submission of jobs to the cluster and for receiving and aggregating results, storing them in a persistent database. This service allows the interaction of a user with the system via a web interface so that they can submit jobs and view experiment reports in HTML format.

1.4

Dissertation Structure

In addition to this introductory chapter, the remainder of this report contains five additional chap-ters. In Chapter 2, a brief explanation of some key biological and RNA-seq concepts essential for understanding the subjects approached by this work is provided. Other relevant tools and technologies to be used with this project will also be detailed in this chapter.

In Chapter 3, we present the system proposed to enable molecular biologist and geneticists to make use of an RNA-seq pipeline: the eRAP system. We describe its architecture, detail each component/module’s functions and describe the methodological approach for their implementa-tion. The chapter concludes by characterizing the studies used to validate the tool.

Chapter4describes the types of analyses we aimed to perform on the data provided by IPA-TIMUP: gene fusion, alternative splicing and gene expression. For each analysis, the tools used and the necessary pre-processing steps were described in detail. Additionally, as the Junctions Comparertool was entirely developed by us to perform the desired alternative splicing analysis, the case study used to validate this tool is also described in Section4.2.5.

(20)

Introduction

Chapter5introduces the case study we performed using real world thyroid cancer data pro-vided by IPATIMUP’s researchers. The chapter starts with the description of the samples propro-vided, the researching objectives and the protocol of analysis. The following sections describe the results we obtained with our analysis.

Finally, Chapter6summarizes what has been described in this dissertation and presents the final conclusions. In this last chapter we also point out possibilities for future work and improve-ments/extensions for each of the developed tools.

(21)

Chapter 2

Biological Key Concepts and

Technology Survey

In this chapter we present the basic biological concepts that are necessary to understand the thesis work. We begin by detailing the gene expression process, splicing and alternative splicing. This will be followed by a review of some of the different standard formats, tools and methods available for analysing RNA. We focus on the iRAP pipeline1and also on the state-of-the-art technologies necessary to implement the computational infrastructure and web portal.

2.1

Basic Biological Concepts

2.1.1 Glossary

We begin by listing some key terms and definitions that will be commonly used throughout the remainder of this dissertation:

• DNA - molecule that carries most of the genetic instructions used in the growth, develop-ment, functioning and reproduction of all known living organisms and many viruses. • chromosome - a thread-like structure of nucleic acids and protein found in the nucleus of

most living cells, carrying genetic information in the form of genes. • gene - distinct sequence of DNA that forms part of chromosome.

• RNA - molecule that acts as a messenger carrying instructions from DNA for controlling the synthesis of proteins, although in some viruses RNA rather than DNA carries the genetic information.

(22)

Biological Key Concepts and Technology Survey

• exon - a segment of a DNA or RNA molecule containing information coding for a protein or peptide sequence.

• intron - a segment of a DNA or RNA molecule which does not code for proteins and inter-rupts the sequence of genes.

• genetic code - the means by which DNA and RNA molecules carry genetic information in living cells.

• codons - a sequence of three nucleotides which together form a unit of genetic code in a DNA or RNA molecule.

• aminoacids - a simple organic compound which serves as a building block for proteins. • 5’/3’ - pronounced 5 prime or 3 prime respectively. They indicate the carbon numbers in

the DNA’s sugar backbone. The 5’ carbon has a phosphate group attached to it and the 3’ carbon a hydroxyl group. This asymmetry gives a DNA strand a "direction". For example, DNA polymerase works in a 5’ -> 3’ direction, that is, it adds nucleotides to the 3’ end of the molecule, thus advancing to that direction.

• DNA strands - the term template strand (or sense strand) refers to the sequence of DNA that is copied during the synthesis of mRNA. The opposite strand (that is, the strand with a base sequence directly corresponding to the mRNA sequence) is called the coding strand (or the antisense strand) because the sequence corresponds to the codons that are translated into protein.

• fusion genes - hybrid gene formed from two previously separate genes. 2.1.2 Genetic Processes

As mentioned in Chapter 1, gene expression is the process by which genes are "translated" into functional genetic products, such as RNA or, at a later stage, proteins, which ultimately regulate cellular activity. This process can be divided into two main phases:

• Transcription - The production of mRNA from DNA by the RNA polymerase enzyme, and further processing of the resulting mRNA molecule.

• Translation - The synthesis of proteins directed by mRNA, and subsequent processing of the resulting protein molecule.

It is also worth noting that not all genes are translated into proteins as a final product of gene expression. Some genes code different types of RNA that play a role in the translation phase such as tRNA or rRNA [GENb].

One of the processing phases that mRNA molecules undergo is called splicing. Simplifying the structure of genetic code, it can be divided into exons and introns. These sections of DNA

(23)

Biological Key Concepts and Technology Survey

Figure 2.1: Simplified overview of the gene expression transcription and splicing processes and the structure of the genetic molecules involved

are kept in precursor mRNA (pre-mRNA) at first as it is initially transcribed from DNA. During the splicing phase, the introns are removed and the pre-mRNA is converted into mature mRNA containing only coding regions (exons), ready to be used for synthesizing proteins. This process is illustrated in Figure2.1. During the protein synthesis phase, the RNA codons in the exons are "translated" into amino acids, forming chains that will eventually become proteins. This process, illustrated in Figure2.2starts with a special start codon (AUG) and ends when one of three stop codons is found (UAA, UAG or UGA).

However, it is also possible that during the splicing phase, some exons are also removed. This leads to a single pre-mRNA strand being capable of generating more than one distinct mature mRNA molecules, depending on which exons are kept in the final product. In turn, this implies that a single gene is able to code more than one protein. This process, shown in Figure2.3, is known as alternative splicing.

It is possible for this mechanism to be compromised, inducing aberrant alternative splicing events [FG08]. This will lead to changes in the proteome: some proteins will no longer be synthe-sized or new ones will be added to the proteome. As proteins play a key role in regulating cellular activity, these changes may have a significant impact on the characteristics of the cancerous cells and confer them new and devastating properties such as capabilities of growth, differentiation and even hormone resistance [ASK+01].

2.1.3 Alternative Splicing

Pre-mRNA splicing is a natural source of cancer-causing errors in gene expression. Mutations in intronic splice sites of tumor suppressor genes often cause exon-skipping events which will result in the truncation of the synthesized proteins, just like classical nonsense mutations. It is also worth noting that many studies over the last 20 years have reported cancer-specific alternative splicing

(24)

Biological Key Concepts and Technology Survey

Figure 2.2: Overview of the gene expression translation process and the structure of the genetic molecules involved2

in the absence of genomic mutations [Ven04].

Types of Alternative Splicing

There are five basic modes of alternative splicing which are generally recognized [HKA]. As depicted in Figure2.4, these modes can be classified as:

• Exon skipping: In this case, an exon may be spliced out of the primary transcript or re-tained. This is the most common mode in mammalian pre-mRNAs.

• Mutually exclusive exons: One of two exons is retained in mRNAs after splicing, but not both.

• Alternative donor site: An alternative 5’ splice junction (donor site) is used, changing the 3’ boundary of the upstream exon.

• Alternative acceptor site: An alternative 3’ splice junction (acceptor site) is used, changing the 5’ boundary of the downstream exon.

• Intron retention: A sequence may be spliced out as an intron or simply retained. This is distinguished from exon skipping because the retained sequence is not flanked by introns. If the retained intron is in the coding region, the intron must encode amino acids in frame with the neighboring exons, or a stop codon or a shift in the reading frame will cause the protein to be non-functional. This is the rarest mode in mammals.

2Image taken from: https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:

(25)

Biological Key Concepts and Technology Survey

Figure 2.3: The alternative splicing process. The colored blocks represent exons in the RNA sequence while the lines between them are introns3

Fusion Genes

A fusion gene is a hybrid gene formed from two previously separate genes. As shown in Fig-ure2.5, it can occur as a result of: translocation, interstitial deletion, or chromosomal inversion. Currently, 358 gene fusions involving 337 different genes have been identified. Additionally, a growing number of gene fusion events are being recognized as important parameters for diagnos-ing (and prognosdiagnos-ing) malignant hematological disorders and childhood sarcomas. The analysis shows that gene fusions occur in all malignancies, and that they account for 20% of human cancer morbidity [MJM07].

2.2

RNA Sequencing and Transcriptome Assembly

Understanding the transcriptome (set of all RNA molecules) is essential for interpreting the func-tional elements of the genome, revealing the molecular constituents of cells and tissues, and also for understanding development and disease. Obtaining information of a cell or organism’s tran-scriptome is done experimentally, by employing sequencing techniques. These were initially dom-inated by Sanger’s method, proposed in 1977, and other similar sequencing techniques [FCK02]. In spite of providing accurate results that came to revolutionize the fields of genetics, such meth-ods were notably slow and costly: applying these techniques on large scale projects, such as the Human Genome Project, required the creation of sequencing centers that housed hundreds of DNA sequencing instruments which, in turn, required large amounts of personnel to operate [Sch07]. Due to the severity of the limitations imposed by these methods, this kind of study was mostly

(26)

Biological Key Concepts and Technology Survey

Figure 2.4: Traditional classification of basic types of alternative splicing. In these graphics, exons are represented by boxes and introns by lines. Regions included in the messages by alternative splicing are yellow while constitutive exons are shown in blue4

restricted to model organisms, such as the fruit fly and mouse genomes [Wol13]. It comes as no surprise that new techniques, capable of producing larger amounts of information at a lesser cost and being employed by single laboratories, have been rapidly growing in popularity.

For many years microarrays were the standard tool for quantifying gene expression and ex-amining features of the transcriptome. However, this technique requires previous knowledge of the genome being studied. It also has propensity for introducing some biases in gene expression measurements from cross-hybridization. For these reasons, it is being replaced by a more recent approach called RNA-seq. Compared to microarrays, RNA-seq provides more informative and reliable results and is capable of applying de novo assembly approaches for exploratory analy-sis, discarding the need for a reference genome5. Further, it provides information on RNA splice events which are not readily detected by standard microarrays.

RNA-seq

The human genome contains 3,234.83 Mega-basepairs (Mbp) while that of Escherichia Coli con-tains only 4.6 million basepairs. This translates into roughly 700MB for humans if a perfect representation was possible (using only 2 bits per base). However, in reality, sequencing a genome requires a lot of information besides just the basepairs (such as quality information) and it be-comes impossible to achieve this perfect representation. Sequencing a whole genome with an average x30 coverage (each base was covered, on average, by 30 reads) requires approximately

4Image adapted from: https://commons.wikimedia.org/wiki/File:Alt_splicing_bestiary2.

jpg

5A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example

of a species’ set of genes. De novo transcriptome assembly is the method of creating a transcriptome without the aid of a reference genome

(27)

Biological Key Concepts and Technology Survey

Figure 2.5: Illustration of gene fusion events at a chromosomal level7

90GB, mapping one base to one byte. Including quality scores and short reads6, as the standard formats generally do, this can scale up to about 180GB of necessary storage [Rei]. These numbers show the complexity of the data being handled and the need for powerful computational resources and sophisticated algorithms for sequencing complex genomes such as the human genome.

For this thesis, we will focus on the seq approach for analysing the transcriptome. RNA-seq is a recently developed approach to transcriptome profiling that uses deep-RNA-sequencing tech-nologies. It also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. RNA-seq is a sequential process that is still under development [WGS09]. A workflow diagram of common steps in the RNA-seq analysis is depicted in Figure2.6.

From analysing the diagram, several stages of the process can be identified. Initially, short reads are obtained by sequencing cDNA8fragments (reads) obtained from a population of RNA. These reads are typically subjected to a quality control stage in which reads below a certain quality threshold are discarded.

Following quality control, the resulting reads are either aligned to a reference genome or reference transcripts. If no reference genome is provided, a de novo assembly is performed to

6The term ’short read’ came about to describe technologies that produced reads that were substantially shorter

(30-50bp) than the mainstream technologies employed at that point (1000bp)

7Image taken from:https://en.wikipedia.org/wiki/Fusion_gene

8complementary DNA (cDNA) is double-stranded DNA synthesized from a single stranded RNA template in a

(28)

Biological Key Concepts and Technology Survey

Figure 2.6: General RNA-seq workflow diagram

produce a genome-scale transcription map that consists of both the transcriptional structure and level of expression for each gene.

After the alignment stage, a quantification is performed in order to summarize and aggregate the reads over a biologically meaningful unit such as exons, isoforms or genes. By analysing the output of the latter phase it is possible to determine which genes are being expressed through differential expression. In turn, this allows determining which proteins will be synthesized. The analysis can be further extended with other optional steps such as gene set enrichment or splicing detection.

2.3

Relevant Standard File Formats

As there is a multitude of sequencing tools available, it comes as no surprise that there are also several distinct file formats used by these tools. In order to be able to integrate the sequencing tools, it becomes necessary to operate with several of these file types. We now briefly describe the file formats relevant to our work.

FASTA

FASTA is the line and character sequence format used by the National Center for Biotech-nology Information (NCBI), resorting to their own character code conventions. It is a simple

(29)

Biological Key Concepts and Technology Survey

text format, that can be used to store either nucleotide (DNA, RNA) or amino acid (protein) se-quences [NCB]. Due to its simplicity and ease of parsing and manipulating, this format is broadly used to store sequencing reads. It is also an attractive solution for data transfer between different tools as several bioinformatic tools use it. An example can be seen in Listing2.1.

1 >gi|31563518|ref|NP_852610.1| microtubule-associated proteins 1A/1B light chain 3A isoform b [Homo sapiens]

2 MKMRFFSSPCGKAAVDPADRCKEVQQIRDQHPSKIPVIIERYKGEKQLPVLDKTKFLVPDHVNMSELVKI

3 IRRRLQLNPTQAFFLLVNQHSMVSVSTPIADIYEQEKDEDGFLYMVYASQETFGFIRENE

Listing 2.1: A simple example of one sequence in FASTA format. The first line describes the sequence while the following lines contain its data

FASTQ

The FASTQ format has emerged as another common format to store biological sequences. It provides an extension to the FASTA format in that it can also store a numeric quality score associated with each nucleotide in a sequence [CFG+10]. It was originally developed by the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has become the de facto standard for storing the output of high-throughput sequencing instruments. While this format suffered from the lack of a formal definition, the Sanger version of the FASTQ format has found the broadest acceptance. An example of a FASTQ file fragment is shown in Listing2.2.

1 @SEQ_ID

2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

3 +

4 !’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65

Listing 2.2: A simple example of one sequence in FASTQ format. The first line begins with a ’@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 contains the sequence data. Line 3 starts with a ’+’ sign and an optional identifier and any description. Line 4 contains the quality values for line 2’s sequence.

SAM and BAM

The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. It supports short and long reads from different sequencing tools and it is flexible, compact, and efficient in random access. The Binary Alignment/Map (BAM) format contains the same information as SAM but in binary format rather than in textual representation. It was designed in order to improve performance [LHW+09]. If present, the header must appear prior to the alignments. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping

(30)

Biological Key Concepts and Technology Survey

position, and variable number of optional fields for flexible or aligner specific information. An example of a SAM file can be seen in Listing2.3.

1 @HD VN:1.5 SO:coordinate

2 @SQ SN:ref LN:45

3 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *

4 r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *

5 r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;

6 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *

7 r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;

8 r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1

Listing 2.3: Example of an alignment in SAM format. The first line (’@HD’) is the header line and must contain the format version (VN). The ’@SQ’ lines identifies the reference sequence. Any other lines that don’t begin with ’@’ represent alignments and have 11 mandatory information fields

GFF and GTF

The Generic Feature Format (GFF) is a standard file format for storing genomic features in a text file. GFF files are plain text, 9 column, tab-delimited files and they can also be used for databases with a custom schema [Gena]. GFF has several versions, the most recent of which is GFF3. The GFF format requires fields to be tab-delimited and only the last field of a feature line is allowed to not have a value. This format incorporates fields for identifying the feature, respective chromosome, the program that generated it and other useful information. An example can be seen in Listing2.4.

The General Transfer Format (GTF) is quite similar to GFF (sometimes being referred to as GFF2.5), which it extends by defining additional conventions.

1 X Ensembl Repeat 2419108 2419128 42 . . hid=trf; hstart=1; hend=21

2 X Ensembl Repeat 2419108 2419410 2502 - . hid=AluSx; hstart=1; hend=303

3 X Ensembl Pred.trans. 2416676 2418760 450.19 - 2 genscan=GENSCAN019335

4 X Ensembl Variation 2413425 2413425 . + .

5 X Ensembl Variation 2413805 2413805 . + .

Listing 2.4: Example of GFF output from an Ensembl export. Each line represents a feature and contains 9 fields respectively identifying the chromosome/scaffold, program source, feature type, start and end positions, score, strand, frame and attributes (semi-colon separated list)

(31)

Biological Key Concepts and Technology Survey

BED

BED (Browser Extensible Data) [gro] format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used. An example of a BED file can be seen in Listing2.5.

1 track name=pairedReads description="Clone Paired Reads" useScore=1

2 chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512

3 chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Listing 2.5: Example of an annotation track that uses a complete BED definition. The header lines begin with the word "browser" or "track" to aid the browser in the display and interpretation of the lines of BED data. The data lines contain 3 mandatory arguments (the first 3) which identify the chromosome/scaffold, start and end position respectively. The 9 additional fields are optional and identify the following features in this order: the name of the line, score, strand, thick start, thick end, RGB value (for the browser), block count, block sizes and block starts

2.4

Tools for RNA-seq

As aforementioned, there are many tools available for each processing stage of the RNA-seq anal-ysis. In this section we will briefly explain some of the tools which are relevant to this thesis. Many of these are included in the iRAP pipeline described in Section2.4.1.

Bowtie

Bowtie9 is a short read aligner. Bowtie aligns reads with the aid of an index of the refer-ence genome and it can also take advantage of multiple processor cores, if available, to achieve speedups and memory-efficiency. The most recent version of this software is Bowtie 2 which im-proves in speed, sensitivity and accuracy [LS12]. This tool is most effective when aligning short read sequences against large reference genomes.

TopHat

TopHat [TPS09] is a fast splice junction mapper for RNA-Seq reads. It uses Bowtie for align-ing the short reads and then analyzes the mappalign-ing results to identify splice junctions between

(32)

Biological Key Concepts and Technology Survey

exons. It can take advantage of multiple processor cores by using parallel processes. These fea-tures make it much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is enough to process the data from a whole RNA-Seq experiment in under a day on a stan-dard desktop computer.

Cufflinks

Cufflinks [TRG+12] is commonly used together with TopHat and Bowtie as it is part of the Tuxedo Suite tools. It assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It receives the already aligned RNA-Seq reads as input and assembles the alignments into a set of transcripts. Cufflinks then uses the supporting reads to estimate the relative abundances of the transcripts, taking into account biases in library10 preparation protocols.

HTSeq

HTSeq [APH14] is a Python library to facilitate the rapid development of scripts for process-ing high-throughput sequencprocess-ing data. It provides parsers for many common data formats in these types of projects, as well as classes to represent data (genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls) and provides data structures that al-low for querying via genomic coordinates. The HTSeq-count tool was developed with this library for preprocessing RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

FastQC

FASTQC [A+10] provides a simple way to do quality control checks on raw sequence data coming from high throughput sequencing pipelines. It accepts data in BAM, SAM or FASTQ format and quickly identify possibly problematic areas. It also allows the summarization of the data with graphs and HTML reports.

SAMTools

SAMTools [LHW+09] is a library and software package for parsing and manipulating align-ments in the SAM or BAM formats. It enables converting from different alignment formats, sorting and merging alignments, removing polymerase chain reaction (PCR) duplicates, generat-ing per-position information in the pileup format, callgenerat-ing sgenerat-ingle nucleotide polymorphisms (SNPs)

10The sequencing library is prepared by random fragmentation of the DNA or CDNA sample, followed by 5’ and 3’

adapter ligation. Alternatively, "tagmentation" combines the fragmentation and ligation reactions into a single step that greatly increases the efficiency of the library preparation process. Adapter-ligated fragments are then PCR amplified and gel purified.

(33)

Biological Key Concepts and Technology Survey

and short indel11variants, and showing alignments in a text-based viewer.

BEDTools

BEDTools [QH10] is a collection of utilities for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: intersect, merge, count, complement, and shuf-fle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF and the Variant Call Format (VCF). While each individual tool is designed to do a rela-tively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

MATS

The Multivariate Analysis of Transcript Splicing (MATS) tool is used to detect differential al-ternative splicing events from RNA-Seq data between two samples [SPL+14]. From the RNA-Seq data, MATS can automatically detect and analyze alternative splicing events corresponding to all major types of alternative splicing patterns.

deFuse

The deFuse [MHZ+11] software is a computational method for fusion genes discovery in tu-mor RNA-seq data. It uses a dynamic programming-based split read analysis with discordant paired end alignments and does not discard paired end reads that align ambiguously. Additionally, deFuse is not limited to finding gene fusions with boundaries between known exons: it can iden-tify fusion boundaries in the middle of exons or involving intronic or intergenic sequences.

EricScript

EricScript [BPM+12] is a computational framework used for the discovery of gene fusion events in RNA-seq data. Using the EricScript simulator, it is able to generate synthetic gene fusions and it can also determine statistical measures for evaluating gene fusion detection methods’ performance by using the EricScript CalcStats tool. Comparatively to other similar tools it is a lightweight framework with high sensitivity [KVQL16], yielding accurate results with minimal memory and time consumption.

2.4.1 RNA-Seq Pipelines

While RNA-seq has become the technique of choice for whole-transcriptome profiling, applying these techniques requires considerable programming skills and computational resources. It is

(34)

Biological Key Concepts and Technology Survey

Figure 2.7: iRAP pipeline architecture [FPMB14]

necessary to process millions of sequencing reads and, at each step of the processing pipeline, there are multiple tools available each with their own advantages and disadvantages. This makes it common to require a specific combination of tools for each different experiment. However, integrating the desired tools can also be an arduous and lengthy task as the input/output files of each tool may not be compatible.

The iRAP [FPMB14] system (available under the GLPv3 license) is an integrated RNA-seq analysis pipeline that integrates many of the major tools available for each step of the RNA-seq process. It targets a broad audience and allows users with little bioinformatics knowledge to customize the pipeline and apply their tools of choice. However, more advanced users are also able to fully customize the options used by the chosen tools. The iRAP pipeline is depicted in Figure 2.7 as a standard analysis pipeline. Table 2.1 lists the available tools for each of the most common stages of an RNA-seq analysis (iRAP incorporates several other tools for other less common processing steps as well).

The iRAP pipeline accepts FASTQ or BAM files as input with raw sequence reads, a reference genome in FASTA format, gene model annotation in GTF format and a configuration file. iRAP then generates web reports that the user can read to explore the results of the analysis.

Table 2.1: Tools integrated in the iRAP pipeline for each standard processing stage

Stage Tools Supported

Quality Control fastqc, fastx, bowtie1

Mapping tophat1, tophat2, smalt, gsnap, soapsplice, bwa1, bwa2, bowtie1, bowtie2, gem, star, osa, mapsplice

Quantification basic, htseq1, htseq2, cufflinks1, cufflinks2, cufflinks1_nd, cuf-flinks2_nd, scripture, flux_cap, nurd, stringtie, stringtie_nd, rsem, kallisto

Differential Expression cuffdiff1, cuffdiff2, cuffdiff1_nd, cuffdiff2_nd, deseq, edger, voom, deseq2

Gene Set Enrichment piano Gene Fusion Analysis fusionmap

(35)

Biological Key Concepts and Technology Survey

2.5

Data Repositories

In order to evaluate the system to be developed, it is necessary to have access to real data. Apart from the data that will be provided by IPATIMUP12, there are several online repositories available online that provide such data as well.

One such repository that we will resort to for obtaining test data is the ENCODE project13. The goal of ENCODE is to build a comprehensive list of all functional elements in the human genome. This includes elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. Through their platform it is possible to access and download several datasets that can be used to validate our thesis work.

Other useful repositories and tools with genomic information also exist such as the ENSEMBL project14and the UCSC Genome Browser15. Both of these allow downloading data via FTP. Addi-tionally, the Cancer Genome Atlas (CGA) provides data on cancer genome sequences, alignments and mutations for several cancer samples on the Cancer Genomics Hub16. One more interesting repository is that of the Genotype-Tissue Expression project17, which collects and analyzes multi-ple human tissues from donors and allows downloading sammulti-ples and other data such as RNA-seq results.

2.6

Additional Technologies

While the work to be done in this thesis focuses mainly on extending the RNA-seq analysis to detect and assess aberrant alternative splicing events, it also encompasses the creation of web services and a website to allow users to run the analysis, maintaining a database with the relevant results. To this end several other technologies are necessary.

AWK

Awk [AKW78] is a unix command encorporating a powerful "programming language" for file analysis and filtering. Its basic operation is to search a set of files for patterns, and to perform specified actions upon lines or fields of lines which contain instances of those patterns. Awk patterns may include arbitrary boolean combinations of regular expressions and of relational op-erators on strings, numbers, fields, variables, and array elements. Actions may include the same pattern-matching constructions as in patterns, as well as arithmetic and string expressions and assignments, if-else, while, for statements, and multiple output streams.

12https://www.ipatimup.pt/Site/ 13https://www.encodeproject.org/ 14http://www.ensembl.org/index.html 15https://genome.ucsc.edu/ 16https://cghub.ucsc.edu/ 17http://www.gtexportal.org/home/

(36)

Biological Key Concepts and Technology Survey

Awk makes certain data selection and transformation operations easy to express, making it an ideal utility for quickly filtering and processing text-based files such as the ones commonly used in the bioinformatics field for describing genomes or sequencing reads.

Python

For the development of the system we decided to use the Python18 language. It is a widely used general-purpose, high-level programming language that emphasizes code readability. Many Python libraries are freely available containing helpful modules for tasks such as web development or even biological computations (such as HTSeq or Biopython19).

Django

Django20 is a high-level Python Web framework. The framework includes features such as au-thentication, URL routing, a templating system, an object-relational mapper (ORM), and database schema migrations. It follows the Model-View-Controller architectural pattern and has been used in several popular sites such as Pinterest21and Instagram22.

There are alternatives to Django such as Flask, Pylons, Grok and TurboGears. Django was chosen as it is a very mature framework with sane defaults which allows developers to dive in to web applications quickly without needing to make a lot of decisions about their application’s infrastructure ahead of time [Bro15]. This makes it ideal for mid-size and large projects. It is also the framework with the largest and most active community.

Databases

For the database it was decided to use a non-relational database as we will be storing massive volumes of data with a possibly different format between experiments. As such, this type of databases provide a flexibility that their relational counterparts do not. It might also become relevant at some point to store the data in a distributed manner across several machines, which is better handled by non-relational databases [LM13].

The non-relational database we opted to use was MongoDB23. It is an open source, document-oriented database designed with both scalability and developer agility in mind. Instead of storing your data in tables and rows as you would with a relational database, in MongoDB you store JSON-like documents with dynamic schemas. Additionally, MongoDB provides the GridFS system24 which is a specification for storing and retrieving files that exceed the BSON-document size limit of 16 MB. Instead of storing a file in a single document, GridFS divides the file into parts, or

18https://www.python.org/ 19http://biopython.org/wiki/Main_Page 20https://www.djangoproject.com/ 21https://pinterest.com 22https://www.instagram.com/ 23https://www.mongodb.org/ 24https://docs.mongodb.com/manual/core/gridfs/

(37)

Biological Key Concepts and Technology Survey

chunks, and stores each chunk as a separate document. GridFS is useful not only for storing files that exceed 16 MB but also for storing any files for which you want access without having to load the entire file into memory, making it useful for the project as the size of both input and output files for RNA-seq analysis tends to be in the order of a few gigabytes.

Web Services

A Web service is a service offered by an electronic device to another electronic device where com-munication takes place over the World Wide Web (WWW)25. Generally, a web service provides an object-oriented web based interface to a database server, which is utilized for example by another web server or mobile application that provide a user interface to the end user. It is also common for web servers to consume several web services at once on different machines in order to compile the content into a single user interface.

For implementing the web services we resorted to Python and Django, using the latter’s tem-plating language to build the website. To simplify the development of these modules, the Py-Charm26integrated development environment was used.

2.7

Chapter Conclusions

In this chapter we provided an overview of the key molecular biology concepts that are essential for the understanding of the thesis work. We also detailed the RNA-seq process and presented short analyses of concrete tools and standard file formats that are relevant to the project. Finally, we also gave a brief introduction to the other technologies used for the project.

25https://www.w3.org/TR/2004/NOTE-ws-gloss-20040211/#webservice 26https://www.jetbrains.com/pycharm/

(38)
(39)

Chapter 3

The eRAP Framework

Let us recall that one of the problems we propose to solve is the development of a tool that helps experts to run RNA-seq analysis without requiring any programming knowledge or familiarity with a command-line interface. This tool is able to use powerful computational resources, it is based on an existing RNA-seq pipeline (iRAP) and has a user-friendly web interface. In this chapter we first introduce the architecture of our framework, the Easy RNA-seq Analysis Pipeline, and then detail each of its components as well as the system’s usage.

3.1

eRAP

Due to the complexity of configuring an iRAP experiment, a user friendly web interface was designed to allow the creation of said experiments and the retrieval of results. The experiment creation form requires as input the main settings necessary to create an iRAP experiment such as the experiment’s (unique) name, species name and reads in FASTA or FASTQ formats. It is not limited to any specific species as it allows users to specify any reference files (annotation in GTF format and reference genome in FASTA format) by indicating the URLs where they are available. The reference files are then retrieved from the specified URLs and kept locally to allow re-use across experiments and avoid wasting time and storage. However, as there are dozens of possible tools to use, the configuration options are also very vast and many have not been implemented on the interface at this stage. These additional options should be specified on the configuration file1 which should be submitted when creating an experiment.

After creating an experiment, it becomes available on the global experiment list page as well as on the owner’s list of experiments. On these pages the status of the experiments is available and further details can be seen on each experiment’s specific page. There the results, output and error logs will also be available once an experiment finishes running.

(40)

The eRAP Framework

Figure 3.1: System Architecture

3.2

System Architecture

The system we implemented to answer the aforementioned problem has its foundation on the iRAP system. As depicted in Figure3.1, there is an independent module responsible for the RNA-seq analysis, which will be a deployment of iRAP. As this component will be responsible for the computationally intensive operations, it will be deployed in a dedicated cluster with sophisticated machines. In this subsystem is included an additional back-end server to allow interaction with the front-end server and the database and it will be through this server that tasks will be created for iRAP.

In order to allow user interaction and for aggregating experiment data, the back-end module will expose an API, via Python with Django, to allow submitting and monitoring tasks. Addition-ally, it will resort to parallelization by maintaining a pool of worker processes in order to allow running multiple tasks at once. Once it receives a request to start an experiment, it will retrieve that experiment’s files, which are stored in the database, and the experiment process will be added to the queue and remain there until a worker process is free and can start processing the experiment. The results will then be processed and stored in a MongoDB database which will be accessed whenever the status of the experiment is updated or results are finished. The output and error logs of the experiment’s execution will also be stored in the database and made available to the users. The front-end server will host a web application in order to provide a user friendly interface and allow researchers without extensive programming knowledge to use the system to run their experiments and analyse the results.

Thanks to the decoupling of website, storage and analysis module, it will be simple to change the configuration of the computational resources at any time.

(41)

The eRAP Framework

Table 3.1: MongoDB Collections

Stage Tools Supported

auth_user User account data

django_session Session data for active users

storage.chunks GridFS field where files’ data is stored storage.files GridFS field where files’ metadata is stored user_server_experiments Experiment data

user_server_gtffile URL and GridFS object for GTF files

user_server_refgenome URL and GridFS object for reference genome files

3.2.1 Data Scheme

MongoDB is a NoSQL, document based, database. The information is organized in collections in which each element is stored as a JSON document. MongoDB is also capable of storing and serving very large files without having to load them entirely into memory via the GridFS system, which splits files into chunks and saves each one as a separate document. Table 3.1 shows the existing collections used for storing the system’s data.

Auth User

An Auth User’s document stores all relevant information about a user: a unique ID, username, first and last names, email, password, type of user and other information such as the last login and the join date. This collection is created by the non-relational Django authentication back-end but is easily extendable by developers if necessary. Listing3.1 shows an example of a document in this collection. 1 { 2 "_id" : ObjectId("5739ee7f8bc3b46ad3fab34a"), 3 "username" : "luiscleto", 4 "first_name" : "Luis", 5 "last_name" : "Cleto", 6 "is_active" : true, 7 "email" : "luisccleto@gmail.com", 8 "is_superuser" : true, 9 "is_staff" : true, 10 "last_login" : ISODate("2016-05-29T20:15:32.397Z"), 11 "password" : "pbkdf2_sha256$12000$ofxfyaH1uA7S$5TMWvOlCwIGVh4sviT2+ XBz7thyKkZKQbVi/gOHYl+k=", 12 "date_joined" : ISODate("2016-05-16T15:59:59.765Z") 13 }

(42)

The eRAP Framework

Django Session

In addition to an unique ID, This collection stores information about users’ sessions such as the expiration date and encrypted session data. It is automatically created by Django’s authentica-tion back-end. An example of a Django session document can be seen in Listing3.2.

1 { 2 "_id" : "wh2bvr4cehb4l5lgm8hteousfdotyrlj", 3 "expire_date" : ISODate("2016-06-06T14:01:40.873Z"), 4 "session_data" : " OWQ1NGFlZmJhMzQxNTVjZjcyYTdiYWUwZGU5MzJmYjE2ODBiN2VmNjp7fQ==" 5 }

Listing 3.2: Example of a Django Session document in JSON format Storange Chunks and Storage Files

Storage Chunks only stores binary data, the sequence number of the file’s chunk and the ID of the corresponding file. Storage files on the other hand stores all relevant metadata about a file such as file size, chunk size, file name, upload date and an MD5 checksum hash2. These collec-tions are automatically created by the GridFS system and their documentation is openly available3.

User Server Experiments

This collection stores all data about an experiment such as its name and files. Additionally, it also maintains a status value which indicated the rate of completion of the experiment (a negative value indicates an error). An example of an experiment document is shown in Listing3.3.

1 { 2 "_id" : ObjectId("574b664f8bc3b44e50b614c1"), 3 "status" : 100, 4 "libraries_file" : "/ecoli_libraries.zip", 5 "date_modified" : ISODate("2016-05-29T21:59:34.690Z"), 6 "description" : "testing", 7 "title" : "ecoli_ex", 8 "gtf_file_id" : ObjectId("574b66468bc3b44e50b61017"), 9 "reference_genome_id" : ObjectId("574b66468bc3b44e50b61016"), 10 "author" : "luiscleto", 11 "conf_file" : "/ecoli_example.conf", 12 "err_log" : "/ecoli_ex.err_IKB20dS.log",

2The MD5 algorithm is a widely used hash function producing a 128-bit hash value. It is often used as a checksum

to verify data integrity, but only against unintentional corruption.

(43)

The eRAP Framework 13 "fail_message" : "", 14 "results_archive" : "/report.zip", 15 "out_log" : "/ecoli_ex.out_Ep8yHb8.log", 16 "date_created" : ISODate("2016-05-29T21:59:34.690Z"), 17 "species" : "ecoli_k12" 18 }

Listing 3.3: Example of an Experiment document in JSON format User Server GTFFile and RefGenome

Both collections have the same data scheme and contain a file’s URL and content. The file_content field indicates the respective GridFS field if the file has already been retrieved and stored. Listing3.4shows an example of a GTFFile document in JSON format.

1 { 2 "_id" : ObjectId("574b66468bc3b44e50b61017"), 3 "file_content" : "/0f2cecc6-1625-4919-aefe-59a733d8dd7f.gz", 4 "date_created" : ISODate("2016-05-29T21:59:34.687Z"), 5 "file_address" : "ftp://ftp.ensemblgenomes.org/pub/release-19/bacteria//gtf /bacteria_22_collection/escherichia_coli_str_k_12_substr_mg1655/ Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf.gz" 6 }

Listing 3.4: Example of a GTF File document in JSON format

3.3

Use Cases

The eRAP system was designed with the researcher in mind as the end user. As Figure3.2shows, a user is able to perform basic authentication tasks such as registering or logging in or out of his account. Additionally, they can create an experiment and also view that experiment’s status and retrieve any input or output files.

3.4

Web Interface

The implementation of the front-end Web Interface module was done using Python with Django and the Jinja24 templating language. Additionally, the Bootstrap framework5 was used to sim-plify designing the application. In order for the server to be deployed successfully, the python requirements in the requirements.txt file provided with the source code must be installed. Addi-tionally, the non-rel version of django should be installed from the django-nonrel project. Finally,

4http://jinja.pocoo.org/docs/dev/ 5http://getbootstrap.com/

(44)

The eRAP Framework

Figure 3.2: System Use Cases

before running the server, a local_settings.py file should be created in the irap_user_server direc-tory which declares the variables shown in Table 3.2. Note that the last three variables are only relevant for the Analysis server and are not used for the Web Interface server.

Table 3.2: Variables to be declared in the local_settings.py file

Variable Description

EMAIL_HOST_USER email address to be used for SMTP authentication EMAIL_HOST_PASSWORD password to be used for SMTP authentication IRAP_SERVER_ADDRESS address of the back-end (analysis) server FRONT_END_SERVER_ADDRESS address of the front-end (web interface) server MAX_NUMBER_OF_PROCESSES size of back-end server’s worker process pool

IRAP_DIR work directory of the iRAP program, should be the same as the $IRAP_DIR environment variable

(45)

The eRAP Framework

Once a user accesses the website, he is greeted with the main page (Figure3.3) from where he has the option to perform account specific actions (register, login or logout) through menus in the top bar, or browse or create experiments through the menus in the left bar.

Figure 3.3: Landing page of the UI module

On the experiments index, a user can either see another user’s or all experiments that are on the system. On this page the title of the experiment is shown along with a progress and status indicator, as shown on Figure3.4.

Figure 3.4: Experiments index of the UI module

Once a user selects an experiment, they will be shown that experiment’s page, where all details and files are available. On this page a user can retrieve the execution logs for the experiment as well as the results in case any are available. On the Progress section, an estimate of completion is provided (or an error message if the experiment failed). Figure3.5shows both of these scenarios.

(46)

The eRAP Framework

Figure 3.5: Experiment pages for a successful and a failed experiment

Finally, a user can also create a new experiment through the experiment creation form. As shown in Figure3.6, this form requires a unique title for the experiment as well as a description. Additionally, a user must specify the reference files (GTF and annotation file) and provide both a configuration file for additional options and an archive containing the reads to be processed. Once an experiment is created, the front-end server will send a request to the Analysis module to start processing the requested experiment.

3.5

Analysis Server

Similarly to the front-end server, this module was developed using Python with Django. As with the previous ones, requirements should be installed and a local_settings.py file needs to be created to define the variables described in Table 3.2. This module should be deployed on the same physical machine as the iRAP system and the database specified in the standard settings should be the same as the one used by the front-end server.

Once the analysis server receives a request from the front-end server to start processing an ex-periment, it will begin retrieving that experiment’s files and setup the experiment to allow running iRAP. It resorts to Python’s multiprocessing module to maintain a pool of workers, along with a queue of experiments to execute, and uses the subprocess module to execute iRAP and analyse

(47)

The eRAP Framework

Figure 3.6: Experiment creation page

its output. As each step is completed, the status of the experiment is updated in the database. Once iRAP finishes executing, the results (if they were successfully generated) are compressed and stored in the database to allow access via the front-end server.

3.6

Case Studies

In order to validate the system’s functionality as well as the iRAP installation, several experiments were created using reference data provided by the ENSEMBL project and raw reads from the European Bioinformatics Institute. In some of the experiments, errors were included in the input files to test the system’s reliability.

3.6.1 Standard Experiment

Using the aforementioned input files together with a valid configuration file6, the system was able to setup the experiment and run the full RNA-seq analysis successfully. At each step the experiment’s status was updated so the progress could be tracked while it was ongoing. All the input and output files were stored in the database to allow retrieval by the end users.

(48)

The eRAP Framework

3.6.2 Experiment With Errors

In addition to the previous experiment, a similar one was conducted but errors were introduced in the configuration file so that it would be invalid and iRAP would fail when executing the analysis. In this scenario, the system was able to detect the failure and provide a generic error message relative to the cause of the error (as seen in the second scenario of Figure 3.5). More details on errors during iRAP execution can be seen by analysing the provided output and error logs.

3.7

Chapter Conclusions

In this chapter we go over the architecture of the implemented system and detail the functions of each component/module. The main modules were implemented in Python and Django, using MongoDB for storing data with the GridFS system. The essentials were setup so that an expert can run RNA-seq experiments of any kind even though some options are currently only available via the text-based configuration file. Additionally, thanks to the decoupling of the interface, storage and core logic of the proposed system, it is simple to change the configuration of the computational resources at any time.

Referências

Documentos relacionados

As novas crenças religiosas que emergiram no final da V dinastia tinham a consolidação do papel do deus Osíris como uma das suas principais características

Esse modelo índio /caboclo que conforma a representação da identidade amazônica parece ser um esquema simbólico ontológico no imaginário social da região, de tal modo que

Provar que a interseção de todos os subcorpos de E que são corpos de decomposição para f sobre F é um corpo de raízes fe

No presente trabalho, o tratamen- to que recebeu uma carpa 15 dias ap6s a emergên- cia de sorgo, apesar de ter resultado em razoável peso de matéria seca das ervas daninhas por

Neste trabalho o objetivo central foi a ampliação e adequação do procedimento e programa computacional baseado no programa comercial MSC.PATRAN, para a geração automática de modelos

In addition, decompositions of the racial hourly earnings gap in 2010 using the aggregated classification provided by the 2010 Census with 8 groups, instead of the one adopted in

The probability of attending school four our group of interest in this region increased by 6.5 percentage points after the expansion of the Bolsa Família program in 2007 and

A revista Ultimato deixa em dúvida sobre sua identidade visual, já que não há uma unidade visual na diagramação da revista. Pode-se perceber que há cinco unidades visuais