• Nenhum resultado encontrado

Computational methods for the identification of transcriptional regulation modules

N/A
N/A
Protected

Academic year: 2021

Share "Computational methods for the identification of transcriptional regulation modules"

Copied!
187
0
0

Texto

(1)Universidade Federal de Pernambuco Centro de Informática. Pós-graduação em Ciência da Computação. Computational Methods for the Identification of Transcriptional Regulation Modules Paulo Gustavo Soares da Fonseca Tese de Doutorado. Recife Fevereiro 2008.

(2)

(3) Universidade Federal de Pernambuco Centro de Informática. Paulo Gustavo Soares da Fonseca. Computational Methods for the Identification of Transcriptional Regulation Modules. Trabalho apresentado ao Programa de Pós-graduação em Ciência da Computação do Centro de Informática da Universidade Federal de Pernambuco como requisito parcial para obtenção do grau de Doutor em Ciência da Computação.. Orientadora: Co-orientadora:. Katia Silva Guimarães (CIn, UFPE) Marie-France Sagot (LBBE, UCB Lyon I). Recife Fevereiro 2008.

(4) Fonseca, Paulo Gustavo Soares da Computational methods for the identification of transcriptional regulation modules / Paulo Gustavo Soares da Fonseca. – Recife: O Autor, 2008. xxii, 164 p. : il., fig., tab. Tese (doutorado) – Universidade Federal de Pernambuco. CIn. Ciência da Computação, 2008. Inclui bibliografia e apêndices. 1. Reconhecimento de padrão. 2. Biologia computacional - Métodos. 3. Bioinformática. I. Título. 006.4. CDD (22.ed.). MEI2008-072.

(5) Agradecimentos. Esta talvez seja a minha última oportunidade de escrever um texto de agradecimentos como estudante. Portanto eu gostaria de aproveitar a ocasião e começar por uma ampla homenagem a todas as minhas professoras e professores, desde o jardim de infância na minha cidade natal, passando pelas escolas que eu frequentei em Recife, meus professores de idiomas e meus professores da UFPE. Felizmente, eu pude conviver com pessoas que tornaram ainda mais interessante esse período de doutoramento. Além dos professores das disciplinas que eu cursei, eu tive vários colegas com quem também muito aprendi e de cujas companhias me comprazi. Agradeço aos meus colegas do programa de Doutorado em Matemática Computacional do CCEN pelas divertidas sessões de estudo, aos meus colegas do CIn pela convivência no nosso descontraído (e às vezes calorento) gabinete e, em especial, aos amigos Lauro Lins, grande companheiro de estudo, e Adalberto Farias, com cuja ajuda sempre pude contar. No meio dessa empreitada, eu vivi uma aventura à parte que foi a de mudar de país. O destino quis que esse país fosse a França. Tudo bem medido e bem pesado, eu considero ter uma dívida de gratidão com o destino... e com a Jeane também, que esteve lá antes, e de quem eu herdei alguns móveis. Je voudrais donc remercier à tous les membres du LBBE pour l’accueil et, en spécial, à mes camarades de l’Équipe Baobab, auprès desquels je m’excuse de ne pas vous citer nominalement, compte tenu le risque de commettre des fâcheux oublis tellement vous êtes nombreux. Néanmoins, je souhaiterais demander la permission pour saluer en particulier deux collègues du labo qui sont devenus des bons amis (ainsi que leur familles): Patricia Thebault et Augusto Vellozo. Encore une fois, merci à tous. Num ambiente onde ainda predominam homens, eu tive a sorte de ser orientado por duas mulheres de grande valor e de temperamentos e virtudes complementares. Eu agradeço à Profa. Katia Guimarães por ter-me apresentado o campo da Bioinformática e por ter-me acompanhado com sua orientação e apoio durante todos esses anos. À Profa. Marie-France Sagot, eu agradeço imensamente, não só pela inestimável orientação científica, mas também pelo enorme suporte humano sem o qual teria sido muito difícil superar os momentos de incerteza. v.

(6) vi. AGRADECIMENTOS. Eu gostaria ainda de agradecer aos examinadores desta tese, os Profs. Paulo Adeodato e Sílvio Melo, que foram também examinadores da minha monografia de qualificação e proposta de tese, o Prof. Marcelo Zaldini, que também já havia gentilmente participado da banca da minha proposta de tese e, finalmente, os Profs. Ronaldo Hashimoto e Aluízio Araújo. Obrigado a todos pelo cuidado na revisão do(s) manuscrito(s) e pelas valiosas críticas e sugestões. Num projeto deste tamanho, o apoio da família é fundamental, e é por isso que eu gostaria de expressar minha profunda gratidão aos meus pais pelo apoio irrestrito que sempre me deram, não apenas neste doutoramento, mas ao longo de toda minha vida. Gostaria, da mesma forma, de agradecer à minha esposa e ao meu filho, Grácia e Pedro, que estiveram comigo em todos os instantes, os bons e os menos bons, e que tanto souberam me encorajar com afeto e dedicação. Nenhuma tese jamais faria justiça ao amor que vocês me deram nem pagaria pelos momentos em que os privei da minha companhia. Assim mesmo, esta eu dedico a vocês. Por fim, mas não menos importante, eu gostaria de agradecer à Fundação CAPES (Brasil) e à Agence Nationale de la Recherche (France) pelo financiamento das minhas atividades..

(7) Resumo. Estudos recentes têm demonstrado que as redes biológicas apresentam características nãoaleatórias, dentre as quais destacamos a arquitetura modular. Neste trabalho, estamos interessados na organização modular das redes de regulação transcricional (RRT), que modelizam as interações entre genes e proteínas que controlam a sua expressão no nível transcricional. Compreender os mecanismos de regulação transcricional é crucial para se explicar a diversidade morfológica e funcional das células. Nós nos propomos a abordar o problema da identificação de módulos regulatórios transcricionais, i.e. grupos de genes co-regulados e seus reguladores, com ênfase no aspecto computacional. Uma distinção importante deste trabalho é que estamos também interessados em estudar o aspecto evolutivo dos módulos transcricionais. Do ponto de vista biológico, a abordagem proposta está fundamentada em três premissas principais: (i) genes co-regulados são controlados por proteínas regulatórias comuns (fatores de transcrição— FTs) e, portanto, eles devem apresentar padrões de sequência (motifs) comuns nas suas regiões regulatórias, que correspondem aos sítios de ligação desses FTs, (ii) genes co-regulados respondem coordenadamente a certas condições ambientais e de desenvolvimento e, logo, devem ser co-expressos sob essas condições, e (iii) uma vez que módulos transcricionais são presumivelmente responsáveis por funções biológicas importantes, eles estão sujeitos a uma maior pressão seletiva e, consequentemente, devem ser evolutivamente conservados. Nós definimos, portanto, o conceito de metamódulo regulatório transcricional (MMRT) como grupos de genes compartilhando motifs e exibindo um comportamento de expressão coerente em contextos específicos consistentemente em várias espécies e propomos modelos probabilísticos para descrever o comportamento modular em termos do compartilhamento de elementos regulatórios (motifs), da co-expressão e da conservação evolutiva das associações funcionais entre os genes com base em dados diversos tais como dados de sequência, de expressão e dados filogenéticos. Palavras-chave: metamódulos regulatórios transcricionais, compartilhamento de motifs, análise de co-expressão, modelos evolutivos de módulos transcricionais, reconhecimento de padrões, modelos probabilísticos, modelos de Markov, cálculo de p-values, biclustering, algoritmos filogenéticos. vii.

(8)

(9) Abstract. Recent studies have demonstrated that biological networks display non-random characteristics, among which we highlight the modular architecture. In our thesis, we are interested in the modular organisation of transcriptional regulation networks (TRN), which model the interactions between genes and proteins that control their expression at the transcriptional level. Understanding the mechanisms of transcriptional regulation is crucial to explaining the morphological and functional diversity of cells. We propose to address the problem of identifying transcriptional regulation modules, i.e. groups of co-regulated genes and their regulators, with emphasis on the computational aspect. One important distinction of our work is that we also interest ourselves to the evolutionary aspect of the transcription modules. From the biological point of view, the proposed approach is supported by three main premises: (i) co-regulated genes are bound by common regulatory proteins (transcription factors—TFs) and so they must present common sequence patterns (motifs) in their regulatory regions, which correspond to the binding sites of those TFs, (ii) co-regulated genes respond coordinatedly to certain environmental or growth conditions, and so they must be co-expressed under those conditions, and (iii) since transcriptional modules are suposedly responsible for important biological functions, they are more subject to selective pressure and therefore they must be evolutionary conserved. We thus define the concept of a transcriptional regulation metamodules (TRMMs) as groups of genes sharing regulatory motifs and displaying coherent context-specific expression behaviour consistently across species and propose probabilistic models to describe the modular behaviour in terms of the sharing of regulatory elements (motifs), of the co-expression, and of the evolutionary conservation of functional associations between genes based on diverse data such as genomic sequence, gene expression, and phylogenetic data. Keywords: transcriptional regulation metamodules, motif sharing, co-expression analysis, transcriptional modules evolutionary models, pattern recognition, probabilistic models, Markov models, p-value computation, biclustering, phylogenetic algorithms.. ix.

(10)

(11) Contents. 1 Introduction. 1. Overview of this text. 3. 2 Background and problem description. 5. 2.1 A Biology primer. 5. 2.1.1. Molecules of life: nucleic acids and proteins. 2.1.2. Gene expression and regulation. 11. 2.1.3. Natural selection. 13. 2.2 Biological data acquisition. 6. 16. 2.2.1. DNA sequencing. 16. 2.2.2. DNA hybridisation arrays. 17. 2.3 Biological networks. 18. 2.4 Identifying transcriptional regulation modules: problem overview. 21. 3 Related work. 29. 3.1 Motif finding. 29. 3.1.1. EM and Gibbs sampling motif finding. 30. 3.1.2. Multispecies motif finding. 32. 3.2 Gene co-expression analysis. 39. 3.2.1. Clustering expression data. 39. 3.2.2. Biclustering expression data. 44. 3.3 Combining expression and sequence data. 51. 3.4 Multispecies gene co-expression analysis. 58. 4 Methods and results. 63. 4.1 Sharing of regulatory elements. 63. 4.1.1. Measuring the motif-sharing property. 64. 4.1.2. The choice of the motif model. 69. 4.1.3. Representing a set of motif occurrences. 74. xi.

(12) xii. CONTENTS. 4.1.4. Counting subword occurrences over the extended trie. 79. 4.1.5. Inferring motifs using a trie-based K -order Markov model. 81. 4.1.6. Computing p-values. 81. 4.2 Gene expression coherence. 97. 4.2.1. Model-based biclustering. 4.3 Module evolution 5 Discussion and future perspectives. 101 113 125. 5.1 Motif sharing analysis. 125. 5.2 Gene co-expression analysis. 128. 5.3 Evolutionary conservation analysis. 130. 5.4 Concluding remarks. 131. A Mathematical background. 133. A.1 Statistical hypothesis testing. 133. A.2 Expectation maximisation. 134. A.3 Markov chains. 136. A.3.1 MCMC: Metropolis-Hastings and Gibbs sampling. 138. A.3.2 Continuous-time Markov chains. 141. A.4 Hidden Markov models. 142. A.5 Bayesian networks. 145. A.6 Probabilistic relational models. 148.

(13) List of Figures. 1.1 TRN modularity. 3. 2.1 The deoxyribonucleic acid—DNA. 7. 2.2 Formation of a peptide bond. 8. 2.3 Protein synthesis. 9. 2.4 The genetic code. 10. 2.5 The Central dogma of molecular Biology. 11. 2.6 Layout of a typical prokaryotic operon. 12. 2.7 Layout of a typical eukaryotic core promoter. 13. 2.8 Regulators of eukaryotic transcription initiation. 14. 2.9 A species tree. 15. 2.10 Chain termination sequencing method. 17. 2.11 Expression profile matrix. 18. 2.12 TRMM properties. 26. 3.1 Input of the EMnEM algorithm. 35. 3.2 Input of the PhyME algorithm. 37. 3.3 Context-specific regulation. 51. 3.4 Procedural two-phase method. 52. 3.5 Probabilistic relational model for gene expression. 55. 3.6 The cMonkey Algorithm. 56. 3.7 Obtaining a refined homologue module. 61. 4.1 P-value of a discrete random variable. 66. 4.2 Motif sharing. 68. 4.3 Comparison of the mean 10-fold cross-validation log-likelihood for 56 motifs. 73. 4.4 Comparing likelihoods from distinct distributions. 74. 4.5 Example of a trie. 76. 4.6 Example of an extended trie. 77. 4.7 Context-specific expression module. 99. xiii.

(14) xiv. LIST OF FIGURES. 4.8 Data matrix partition. 106. 4.9 Results of the bicluster Gibbs sampler with synthetic data. 111. 4.10 Results of the bicluster Gibbs sampler with the Yeast galactose data. 114. 4.11 Mapping the entire set of genes into one single vector. 116. 4.12 Homology graph for three species. 117. 4.13 Human, mouse and yeast species tree. 122. A.1 Example of a Bayesian network. 146. A.2 Example of a PRM. 150. A.3 Example of a ground Bayesian network. 151.

(15) List of Tables. 3.1 Classification of clustering methods. 41. 4.1 Comparison of sites p-values for 38 K -order Markov motifs. 75. 4.2 P-value computation times. 95. xv.

(16)

(17) List of Algorithms. 3.1 The Gibbs motif sampler algorithm. 33. 4.2 Counting the matches of a pattern with wildcards in a trie. 80. 4.3 The Gibbs motif sampler for trie-based K -order Markov model. 82. 4.4 Branch and bound motif p-value computation. 85. 4.5 Dynamic programming computation of the maximum suffix log-score. 87. 4.6 p-value iterative refinement. 90. 4.7 Maximum rounding error for K -order Markov models. 92. 4.8 Log-likelihood score distribution. 93. 4.9 Log-likelihood threshold for a given p-value. 97. 4.10 Bicluster Gibbs sampling. 108. 4.11 Felsenstein’s tree-likelihood algorithm. 120. A.12 The Expectation Maximisation Algorithm. 135. A.13 The Metropolis-Hastings sampling algorithm. 139. A.14 The Gibbs sampler algorithm. 140. A.15 The Viterbi algorithm. 144. A.16 The Baum-Welch algorithm. 144. xvii.

(18)

(19) Notation. In the table below, we summarise the typographical conventions used throughout the text.. General Type of expression. Formatting convention. Example. Numerical constant. Uppercase latin. N = 10. Numerical variable/index. Lowercase latin. i = 1, . . . , N. Numerical array. Bold lowercase latin. v = (v 1 , . . . , v N ). Set. Calligraphic uppercase latin. G = {g1 , . . . , gS }. Type of expression. Formatting convention. Example. Character constant. Monospaced latin. A, c. Character variable. Lowercase latin. a =a. String of characters. Bold lowercase latin. x = x 0x 1 · · · x N. Type of expression. Formatting convention. Example. Random variable. Uppercase latin. X ∼ N (0, 1). Random variable valuation. Lowercase latin. X =x. Random vector. Bold uppercase latin. X = (X 1 , . . . , X N ). Random vector valuation. Bold lowercase latin. X=x. Parameter. Lowercase greek. P(x ; θ ). Parameter vector. Bold lowercase greek. θ = (θ1 , . . . , θN ). Parameter space. Uppercase greek. θ ∈Θ. Strings. Probabilistic Models. xix.

(20) xx. NOTATION. Besides these general typographic conventions, we also employ the following specific notation: Notation. Meaning. #[X]. Cardinality of the set or event X. Pr[X]. Probability of the set or event X. x. The mean of vector x. var (x). The variance of vector x. E(X ). The expected value of the r.v. X. We notice also that we sometimes use vectors instead of sets (or multisets) to represent collections of objects. We chose the vector format because it is convenient to think of those sets as ordered (for example, one can unambiguously use genes and conditions as indices of arrays) and because this is closer to the computer representation. Nonetheless, when those vectors appear in contexts in which they should be interpreted as sets, this can usually be done with no risk of confusion. Moreover, given a vector v = (v 1 , . . . , v N ), we abuse the notation and write v0 ⊂ v even though ’⊂’ usually denotes the subset relation. In this case, we mean that v0 = (v i 1 , . . . , v i L ) for some 1 ≤ i 1 < i 2 < · · · i L ≤ N ..

(21) List of Abbreviations. For convenience, we list some abbreviations frequently used in the text along with their meaning.. Abbreviation. Meaning. BIC. Bayesian information content. BN. Bayesian network. ChIP. Chromatin immunoprecipitation. c.p.d. Conditional probability distribution. CTMC. Continuos time Markov chain. DNA. Deoxyribonucleic acid. EM. Expectation maximisation. GTF. General transcription factor. HMM. Hidden Markov model. i.i.d.. Independent and identically distributed. MAP. Maximum a posteriori. MC. Markov chain. MCMC. Markov chain Monte Carlo. ML. Maximum likelihood. mRNA. Messenger RNA. PCR. Polymerase chain reaction. p.d.f.. Probability density function. PIC. Preinitiation complex. p.m.f.. Probability mass function. PRM. Probabilistic relational model. PSSM. Position-specific score matrix. PWM. Position weigth matrix. RNA. Ribonucleic acid. RNAPol. RNA Polymerase xxi.

(22) xxii. LIST OF ABBREVIATIONS. Abbreviation. Meaning. r.v.. Random variable. TF. Trancription factor. TFBS. Trancription factor binding site. TRM. Transcriptional regulation module. TRMM. Transcriptional regulation metamodule. TRN. Transcription regulation network. tRNA. Transfer RNA. w.l.o.g.. Without loss of generality.

(23) CHAPTER 1. Introduction. In recent years, biotechnological advances have made possible the production, at an unprecedented rate, of massive amounts of data about the cells of diverse tissues and organisms. These data contain information about (i) the chemical structure and composition of the organic macromolecules which are responsible for the codification and transmission of hereditary characteristics of living beings, the DNA and RNA, (ii) the composition and structure of several families of proteins, which are the basic constituents of living organisms, (iii) the level of activity of the genes, that are regions of the DNA from which proteins are produced, (iv) biochemical reactions that occur in the interior of the cells, etc. Acquisition of these data constitutes, by itself, a great scientific challenge, having represented, to a great extent, the focus of research over the last decades. The success achieved in this direction, which can be testified, for example, by the complete sequencing of the human genome, is fundamentally based on collaborative work involving several fields of knowledge, including Biology, Chemistry, Physics, Mathematics, Statistics, Computer Science, among others. Much bigger yet, is the challenge of interpreting these data, in order to transform this enormous amount of information into useful knowledge. Here, as before, interdisciplinary efforts are, more than necessary, urgent. Biological research has traditionally dealt with isolating components of biological systems and studying their individual characteristics and functions. Given that biological systems, from simple bacterial cells to complete ecosystems, are constituted of several components that interact in a complex and orchestrated way, we can only expect to fully understand the biological phenomena sustaining life if we shift from this parts-list identification paradigm to a more integrative approach, in which we are also interested in the associations and interactions between the system components as well as in the dynamics of those interactions. This is the essence of the emerging field of Systems Biology (Kitano, 2002a,b) which proposes to explain biological systems from a holistic, system-level perspective through systematic (computational) analysis of heterogeneous experimental data. Long term objectives are even more ambitious: eventually, we would like to be able to build predictive models which would let us anticipate future states of the modelled systems through computational 1.

(24) 2. CHAPTER 1 INTRODUCTION. simulation. This would have profound impacts in areas such as medicine and agriculture. An example of the need for a more system-oriented analysis arises when one wants to explain the complexity of organisms from their genomes. For instance, the genome of the fruit fly Drosophila melanogaster has less than 14,000 genes, against the approximately 20,000 genes of the worm Caenorhabditis elegans (a 1mm soil nematode, with a lifecycle of approximately 3 days), in spite of the fact that the latter does not possess the same cellular diversity as the former. The human genome, in its turn, contains about 30,000 protein-coding genes, being only 1.5 times larger than that of the worm. Clearly then, the complexity of an organism does not depend solely on the genes it contains. The way genes are interconnected and controlled is as important as the genes themselves and their products (Levine and Tjian, 2003). Unravelling the mechanisms of gene regulation represents one of the most important research goals of the post-genomic era. A first step towards deciphering the gene regulatory logic consists in building a map of the interactions between the genes and their regulators. For the time being, it is enough to know that these regulators are simply proteins that are produced by some genes, and that they can somehow regulate the level of activity of one or more genes, possibly even the genes that produce it (we will discuss the mechanisms of gene regulation in further detail in Section 2.1.2). This static representation of regulatory interactions (and similarly, of many other kinds of biological interactions) takes the form of a biological network (Alm and Arkin, 2003), in this particular case, a transcriptional regulation network (TRN). In a TRN, nodes represent genes and regulatory proteins. An arc gene→regulator means that the gene produces the regulator, whereas an arc regulator→gene states that the regulator controls the gene. Constructing such networks imply identifying the genes and their products, as well as the regulators and their protein-DNA interactions. Traditional laboratory techniques for these tasks are costly and time-consuming. Several computational techniques have thus been developed to help carrying out the job. Regulation networks, as well as other kinds of networks, biological and non-biological, often present some characteristics that make them more amenable to computational treatment (Alon, 2003). Among these characteristics, we highlight their modularity. From a purely topological point of view, a module is a subset of nodes highly interconnected between them, but loosely coupled to other nodes (see Figure 1.1). From a biological standpoint, a module in a regulation network corresponds to a set of co-regulated genes plus the corresponding regulators. Global regulatory networks are made of a collection of loosely connected and partially overlapping modules. Hence, the identification of regulatory modules is a fundamental step in a bottom-up strategy for the elucidation of the regulatory pro-.

(25) 3. 1.0 OVERVIEW OF THIS TEXT. gram of the cell, providing important indications about core biological processes and their main actors.. P. G. G. Protein P regulates gene G. P. Gene G encodes protein P. TF. Gene product TF TFBS. Gene. Figure 1.1 Schema of a TRN decomposed into modules (delimited by the dotted ovals).. In this thesis, we propose to study the problem of characterising modules of co-regulated genes and their condition-specific regulators though the computational analysis of diverse experimental data concerning multiple species.. Overview of this text The remainder of this text is organised as follows: Chapter 2 is conceptually divided into two parts: in the first part we provide biological background information necessary for understanding the problem of the identification of transcriptional regulation modules in general, its relevance, and the context into which it is inserted. Then, in the second part, we give a general presentation of the problem from a non-specialist perspective, sketching the main objectives we wish to attain. Chapter 3 contains a literature review of related work on the specific problems we are trying to address or subparts of it. We explain those methods in order to highlight their main ideas, with emphasis on their methodological aspects..

(26) 4. CHAPTER 1 INTRODUCTION. Chapter 4 is the main chapter of this thesis and contains a formal description of our proposed methods (models and algorithms) for dealing with the problems introduced in Chapter 2, and the analysis of the experimental results. Chapter 5 contains an overall discussion about the research work of this thesis and the obtained results, with the delineation of its main contributions. We also present future perspectives concerning this work, indicating a few directions in which it could be ameliorated and/or extended. We wanted this thesis to be not only a research report but also some sort of memories of the studies we had the opportunity to do during this doctorate period. However, this work lies in the boundary of a few research areas, and even if we consider its position within the (sub)field of Bioinformatics itself, it touches a few distinct important sub-problems, which adds to the difficulty of presenting the material herein in a concise and relatively selfcontained manner. Because of our own academic background, and since this thesis is to be presented in a Computer Science department, the adopted tone is more mathematical than biological. Hence the reason for presenting the basic biological material in the beginning of the text, assuming it may be less familiar, and not having the equivalent “mathematical preliminaries”. Despite of this, we do include appendices with background material on assorted specific mathematical topics appearing in the text..

(27) CHAPTER 2. Background and problem description. In this chapter, we introduce the problem we propose to study from a high-level perspective. We do not go into details of how we model the problem neither delve into specific questions, but we sketch the problem setting and state the overall objectives to be pursued. However, before talking about the research topic itself, we give an introduction to the main biological concepts involved, in the hope that this will help both to motivate the topic and facilitate its understanding.. 2.1. A Biology primer. Biology—the study of life—has accompanied mankind since the earliest days. Genetics is the branch of Biology that concerns the study of heredity and variation in organisms. It has become increasingly popular over the last decades because of its potential applications to medicine and agriculture, for instance, and the implied ethical, social, economical, and environmental issues. The rapid development of modern Genetics is intimately related to advances in Biochemistry. The union of Genetics and Biochemistry gave rise to a discipline widely known as Molecular Biology, which strives to understand observed biological characteristics and behaviour (phenotype) in the light of their molecular codification (genotype) and related processes that occur in the interior of cells. A core task performed by cells, implicated in virtually all its vital functions, is that of producing proteins from genes. This process is controlled by sophisticated molecular circuits whose decodification constitutes one of the greatest challenges of the life sciences in our era. In this section, we explain the basics of this process, its main actors, and control mechanisms. 5.

(28) 6. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. 2.1.1. Molecules of life: nucleic acids and proteins DNA. The deoxyribonucleic acid, or DNA, is a macromolecule constituted of smaller units called nucleotides. Each nucleotide, in its turn, is composed of a pentose sugar molecule (deoxyribose), a phosphate group, and a nitrogenous base that distinguishes one nucleotide from another. The four possible nitrogenous bases are: adenine (a), guanine (g), thymine (t), and cytosine (c). The DNA has the form of a double helix, that is, two strands coiled around a central fibre axis (Figure 2.1(a)). Each strand represents a chain of nucleotides linked by phosphodiester bonds. The strands are held together by hydrogen bonds between the bases. These bonds follow a strict rule: the purine bases (a and g) of one chain are paired to pyrimidine bases (t and c) on the other chain and vice versa (Figure 2.1(b)), forming the so-called Watson– Crick base pairs (bp). More precisely, an adenine always matches a thymine, and a guanine matches a cytosine (and conversely). As a result of this rule, despite of being not identical, the two DNA strands are complementary, in the sense that each determines the other unequivocally. Besides, each chain possesses a relative orientation, given by the numbers of the carbon atoms of the pentose, labelled from 1’ to 5’, through which nucleotides bind to their neighbours in the strand. Positive (or sense) orientation is conventionally assumed to be the 5’-3’ direction, in which a nucleotide is attached to the previous nucleotide through the 5’ carbon, and to the next nucleotide through the 3’ carbon. In a symmetric manner, the 3’-5’ is taken as the negative (or antisense) orientation . The DNA molecule can be represented by a string of characters over the alphabet {a, c, g, t}. This string corresponds to the sequence of nucleotides in one of the strands in the positive orientation. In eukaryotes—higher organisms, mostly multicellular, whose cells possess distinct, membrane-bound, nuclei—DNA is bundled into structures known as chromosomes that lie inside the cell nucleus. In prokaryotes—simple unicellular organisms which, in contrast to eukaryotes, do not have organised nuclei—chromosomes consist of single circular molecules of DNA which are found free in the cytoplasm. Proteins Proteins are polymers, that is, large chains composed of smaller structural units (monomers) known as amino acids linked by peptide bonds, as shown in Figure 2.2. An amino.

(29) 7. 2.1 A BIOLOGY PRIMER. (a). (b). Figure 2.1 The deoxyribonucleic acid—DNA. (a) Schematic tridimensional conformation of the DNA molecule. (b) Ribose and phosphate molecules from the same strand form a sugar-phosphate backbone to which nitrogenous bases are attached. Bases on one strand bind to complementary counterparts on the other strand through hydrogen bonds: two in the pairing a = t, and three in the matching g ≡ c. © National Human Genome Research Institute (www.genome.gov). acid is a molecule composed of a central carbon atom, known as the alpha carbon, or Cα , to which are attached a hydrogen atom (H), an amino group (NH2 ), a carboxy group (COOH), and a side chain that distinguishes one amino acid from another. There are twenty amino acids commonly found in proteins in nature, roughly represented by capitalised letters in the range A–Y. Proteins are essential constituents of the cell. They amount to more than half of the dry weight of cells and are involved in almost all cell processes. Some proteins, such as keratin and collagen, known as structural proteins, form most of the solid material of organisms (skin, hair, bones, nails, muscular fibre, etc.). Other proteins are in charge of helping cells to perform their activities: they are called functional proteins. A special group of functional proteins is comprised by the so-called enzymes, that serve as catalysts of specific chemical reactions that would otherwise take too long to complete, if ever started. For instance, the amylase is an enzyme contained in the saliva that helps in the digestion of starch..

(30) 8. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. R1 H2N. O. C!. H. C. H. H N. OH. C!. H. COOH. R2 H2 O. H2N. R1. Peptide bond. H. H. C!. C. N. C!. H. O. COOH. R2. Figure 2.2 Formation of a peptide bond. The carbon of the carboxy group of one amino acid is R1 H H O R3 bound to the nitrogen of the amino group of the other amino acid. During the formation process, a + H3 N C N C! C N C OC! C! water molecule is released. H. O. R2. H. H. O. RNA and protein biosynthesis The ribonucleic acid (RNA) is another form of nucleic acid present in cells. RNA is similar to DNA in the sense that both are composed of nucleotide chains. Major differences from DNA are: (i) RNA is normally a single stranded molecule, (ii) in RNA nucleotides, the pentose sugar is the ribose, instead of deoxyribose, (iii) in RNA, the pyrimidine base thymine (t) is replaced by uracil (u), and (iv) RNA varies in both size and structure far more than DNA. There are different kinds of RNA involved in basic cell activities, as we shall shortly see. One of the core tasks performed by cells is that of producing (or synthesising) proteins. Protein synthesis in eukaryotes is more elaborate than in prokaryotes. In both cases, though, it occurs in two phases named transcription and translation, which are illustrated in Figure 2.3 and explained next. Transcription. At this stage, a contiguous stretch of one strand of the DNA—the coding. strand—is transcribed into an RNA molecule known as messenger RNA, or mRNA. Like other biochemical reactions, this polymerisation reaction is controlled by a complex of enzymes, among which an enzyme called RNA polymerase plays the major role. The whole transcription process can be roughly subdivided in three steps: initiation, chain elongation, and termination. The RNA polymerase recognises and strongly binds to a short specific region, known as promoter, which indicates the start point of the transcription. It then induces the separation of the two DNA strands by breaking the hydrogen bonds between them. With strands.

(31) 2.1 A BIOLOGY PRIMER. 9. Figure 2.3 Protein synthesis. © NHGRI (www.genome.gov). set apart, it initiates transcription from the transcription start site (TSS), located a few basis downstream1 of the promoter. The RNA polymerase catalyses the formation of phosphodiester bonds between ribonucleotides using the coding DNA strands as a template, that is, it starts to slide through the template strand and, for each deoxyribonucleotide (c, g, t or a) found on its way, a complementary ribonucleotide (resp. g, c, a or u) is added to the mRNA molecule that is progressively elongated. RNA chain elongation occurs in the 5’-3’ direction (thus the coding strand is read in the 3’-5’ direction). This procedure eventually halts with the RNA polymerase being released from the DNA molecule, which is then recomposed. The contiguous stretch of DNA that is transcribed into mRNA, and later translated into a protein is called a gene. DNA is not always densely occupied by genes. Only 2%–3% of 1 We. use the terms upstream and downstream to refer to positions relative to the orientation of the coding strand, with the promoter being upstream from the TSS. It is also a convention to number nucleotides according to their position relative to the first transcribed nucleotide, which we label +1. Nucleotides downstream are numbered +2, +3, +4, . . ., whereas nucleotides upstream are numbered −1, −2, −3, and thereafter..

(32) 10. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. human DNA, for instance, is constituted of genes, the remaining 97%–98% corresponding to intergenic DNA whose function is still largely unknown. Translation. Once mature, mRNA leaves the nucleus (in eukaryotes) towards the cytoplasm,. where translation into proteins takes place inside cellular organelles called ribosomes. Another kind of RNA—transfer RNA, or tRNA—takes part in this process. A tRNA is usually a short RNA molecule (74–93 residues) which possesses a region, called anticodon, composed of a triplet of nucleotides, that base-pairs to the complementary triplet, known as codon. It has also another region in the opposite side, the acceptor stem, which binds to a specific amino acid. Which particular amino acid is bound to the tRNA is determined by the corresponding codon, as specified in Figure 2.4—the so-called genetic code. 1st base. u. c. a. g. 2nd base. u. c. a. g. Phe (F) Phe (F) Leu (L) Leu (L) Leu (L) Leu (L) Leu (L) Leu (L) Ile (I) Ile (I) Ile (I) Met (M) Val (V) Val (V) Val (V) Val (V). Ser (S) Ser (S) Ser (S) Ser (S) Pro (P) Pro (P) Pro (P) Pro (P) Thr (T) Thr (T) Thr (T) Thr (T) Ala (A) Ala (A) Ala (A) Ala (A). Tyr (Y) Tyr (Y). Cys (C) Cys (C). STOP. STOP. STOP. Trp (W) Arg (R) Arg (R) Arg (R) Arg (R) Ser (S) Ser (S) Arg (R) Arg (R) Gly (G) Gly (G) Gly (G) Gly (G). His (H) His (H) Gln (Q) Gln (Q) Asn (N) Asn (N) Lys (K) Lys (K) Asp (D) Asp (D) Glu (E) Glu (E). 3rd base. u c a g u c a g u c a g u c a g. Figure 2.4 The genetic code. Notice that the code is degenerate since the 64 (43 ) possible codons originate only 20 amino acids.. Translation also has three identifiable steps. At the initiation step, the mRNA start site is recognised by the ribosome and an initiation complex is formed. During the elongation phase, the mRNA slides through the ribosome who consecutively ‘reads’ a codon, recruits an amino acid molecule bound to a tRNA with the corresponding anticodon, and catalyses the formation of a peptide bond between that amino acid and the last added one, forming an elongating chain. Finally, at the termination step, the ribosome recognises the stop codon and interrupts the elongation of the peptide chain, releasing it from the translation complex. The processes described above, together with the processes of DNA self duplication and.

(33) 11. 2.1 A BIOLOGY PRIMER. reverse transcription (not described here) constitute the so-called central dogma of molecular biology, depicted in Figure 2.5. Duplication Transcription. DNA. RNA. Translation. Protein. Reverse transcription. Figure 2.5 The central dogma of molecular Biology.. 2.1.2. Gene expression and regulation. Virtually every cell of an animal or plant contains the same set of chromosomes. It is surprising, therefore, that they can have so many different forms and perform so diverse functions. What makes pancreas cells produce insulin and not, say, tears? In principle, pancreas cells have the ‘recipe’ for producing tears, or milk, but some circuitry seems to selectively inhibit the action of inappropriate genes. When a protein is produced from a gene, we say that the gene is expressed. Regulation of gene expression corresponds to that intricate logic that controls which genes must be activated, and which must be repressed under a given condition, such as a particular stage in a cell’s life cycle, or the presence of some external stimulus. It lies at the heart of the processes of cellular differentiation (Levine and Tjian, 2003). Regulatory mechanisms are diverse and work cooperatively, rather than independently, what makes their study even more complicated. The regulation of gene expression occurs at several levels: transcriptional, post-transcriptional, translational, and post-translational. This classification essentially reflects the moment at which the cellular machinery acts to attenuate or improve the assembly of the product of a gene or group of related genes. In this work, we will be mostly concerned with transcriptional regulation, since most cellular pathways are known to be controlled at the transcriptional level. A basic transcription control mechanism consists of the binding of special proteins known as transcription factors (TF) to specific control regions of the DNA known as transcription factor binding sites (TFBS). TFs can have the effect of either preventing or fostering the initiation of transcription by avoiding the proper coupling of RNA Polymerase to the promoter, or by favouring this binding, respectively. The former effect is known as repression, and the.

(34) 12. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. latter, as activation. Corresponding factors are respectively called repressors and activators. Most of the well-known regulatory logic in prokaryotes is centred around the concept of an operon, which is basically a group of adjacent structural genes with a common promoter and a regulatory site known as operator (see Figure 2.6). An operon is a basic transcription unit of prokaryotic cells, that is, their genes are transcribed into a single polycistronic mRNA molecule which encodes the corresponding set of (functionally related) proteins, one for each gene. transcription start P. O. G1. G2. Gn. +1 Promoter. Operator. Structural genes. Figure 2.6 Layout of a typical prokaryotic operon. The operator acts as a sort of on/off switch, controlling the expression of the physically linked structural genes in the operon. As explained before, transcription initiation depends on the proper binding of the RNA Polymerase complex to the promoter. The operon is usually organised in such a way that, if a specific regulatory protein is bound to the operator, then it obstructs the linkage of the RNA Polymerase to the promoter therefore repressing transcriptional activity. The action of the regulatory protein depends upon an effector (or inducer) molecule which binds to it, interfering on its ability to adhere to the operator site. The influence of the effector can either be negative, i.e., impede RNA Polymerase coupling, or positive, i.e., facilitate it. Gene regulation in eukaryotes is far more involved than in prokaryotes. Even among eukaryotes, the level of complexity and organisation of the regulatory apparatus varies immensely, from a moderately complex mechanism found in unicellular fungi to an utterly elaborate machinery found in metazoans. In any case, though, much of the transcriptional regulatory logic resides in the control of transcription initiation. As with prokaryotes, the key role in eukaryotic transcription initiation is played by the RNA Polymerase riboenzyme. Contrary to prokaryotic cells, eukaryotic cells contain three different types of RNA Polymerase (I,II, and III), being the RNA Polymerase II (RNAPolII for short) the one involved in the transcription of protein-encoding genes..

(35) 13. 2.1 A BIOLOGY PRIMER. RNAPolII alone is not capable of firing the transcription process. Instead, it relies on the action of a few proteins known as general transcription factors (GTFs). GTFs help in the correct positioning of the RNAPolII on the promoter, in the separation of the two DNA strands, as well as in the release of the RNAPolII from the promoter as soon as transcription initiates. RNAPolII, together with GTFs and coactivators (see below) form the preinitiation complex (PIC), which binds to a region typically spanning nucleotides in the (−50, +50) interval called the core promoter (Figure 2.7).. -38. -32. INR -26. YYAN T/A YY +1. DPE. !. TATA TATAAA. !. BRE G/C G/C G/C CGCC. RG A/T CGTG +30. Figure 2.7 The eukaryotic core promoter is usually composed of four subunits: TATA box (TATA), initiator sequences (INR), downstream promoter element (DPE), and the TIIFB recognition element (BRE). Typical locations and site consensi are displayed. As for the consensus, Y represents any pyrimidine (c or t), and R represents any purine (a or g).. Apart from the RNAPolII PIC, in vivo transcription usually depends on other regulatory protein-DNA interactions. In addition to the core promoter, eukaryotic cells often include proximal promoters, usually located a few hundred bases upstream the TATA box, as well as distal promoters (sub-classified into enhancers and silencers), scattered over distances from tens to hundreds of kilobases both upstream and downstream the gene they regulate. Enhancers and silencers do not interact directly with the RNAPolII Complex, but rely on certain coactivators/mediators to intermediate that interaction. To prevent distal promoters from inappropriately affecting genes in the same region of the chromosome, eukaryotic genomes also possess regulatory DNA stretches called insulators located between enhancers/silencers and promoter of adjacent genes. The main elements of transcriptional regulation in a typical eukaryotic cell presented so far in this section are summarised in Figure 2.8. 2.1.3. Natural selection (in a nutshell). The foundations of the theory of Natural selection were laid out by Charles Darwin in his never-enough-praised The Origin of Species of 1859. It has been the subject of intense criticism but also of incremental refining since then and remains, to these days, as the most widely accepted scientific theory for the evolution of living organisms. By the time the theory of Natural selection was proposed, the mechanisms of heredity were unknown. Only.

(36) 14. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. Figure 2.8 Schematic view of the several participants of the transcription-initiation process in a typical eukaryotic cell.. much later, with the discovery of the DNA, it could be established that the phenotype was backed up by a molecular codification, the genotype, and that the mechanisms of heredity act at the level of the genes, the basic hereditary unit. The central idea of the theory of Natural selection is that the great diversity in the forms of life results from small random modifications accumulated through generations over the course of millions of years. Natural selection acts over heritable traits (physical characteristics that can be transmitted from parents to offspring), induced by selective pressure due to environmental aspects. It pre-suposes hence genetic variation. When individuals reproduce, they give rise to offspring with diverse genotype (and not exact copies) because of mutation and recombination events, this variation resulting in different levels of fitness. The concept of fitness of a genotype is usually associated to the mean rate of reproductive success of a group of individuals presenting it. When a new genotype confers higher fitness, it will become more and more frequent over successive generations, being eventually fixed and shared by the entire population. This is known as directional selection. A complementary (and more common) form of selection called stabilising or purifying selection occurs when the new phenotype results in lower fitness, being thus filtered out over successive generations, becoming less and less frequent until eventually disappearing..

(37) 2.1 A BIOLOGY PRIMER. 15. An important phenomenon resulting from natural selection is that of speciation. Two subgroups of individuals of a same population may be formed such that the mating between individuals of a same subgroup becomes much more frequent than the mating between individuals of different subgroups, and that may arise, for instance, because of a geographical separation, or because of sexual selection. In addition, it may happen that natural selection acts differently in the two subgroups, causing incompatible differences to be accumulated, and eventually fixed, separately by each subgroup up to a point where individuals of different subgroups can no longer interbreed. At this point, these reproductively isolated groups will have constituted two different species. It has become a standard practice to represent the process of evolution of species in the form of a tree, the tree of life, with each node corresponding to a species (or higher hierarchical unit: genus, family, class...) connected, upwards, to its ancestral species and, downwards, to its descendant species. The leaves of the tree correspond to current species whereas the root of the tree would correspond to a primitive common ancestral species (see Figure 2.9 for an example).. Figure 2.9 An illustrated partial species tree for the Drosophila genus. (http://insects.eugenes.org/species/). Source: DroSpeGe.

(38) 16. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. 2.2. Biological data acquisition. Much of the recent progress in computational Biology can be credited to biotechnological advances which led to the development of techniques allowing for the acquisition of large scale data concerning different aspects of cellular structure and activity.. 2.2.1. DNA sequencing. The process of identifying the exact composition of a DNA molecule, or more precisely, the ordered sequence of nucleotides that constitute a DNA strand is called DNA sequencing. It lies at the heart of the boom of genomic research started few decades ago. Most high-scale sequencing projects are based on the chain termination method designed by F. Sanger in the late seventies, and for which he won the Chemistry Nobel Prize in 1980. The Sanger method explores the fact that DNA molecules whose length differ by one single nucleotide can be separated by gel electrophoresis. In this technique, a pool of differently sized DNA molecules is put in one end of a lane (capillary glass tube) filled with a viscous polymer. An electric current is then applied between the extremities of the lane, causing molecules to start moving towards the opposite end. Smaller molecules will face less resistance to slide through the gel, and thus, after some time, they will have moved further than bigger ones, so that each molecule will be in a distinguishable distance range (band) from the start point, depending on its size. Modern implementations of the Sanger method use a technique known as polymerase chain reaction (PCR) developed by K. Mullis in the early eighties (and for which he won the Chemistry Nobel Prize in 1993) to obtain samples of differently sized partial copies (or clones) of the DNA to be sequenced. The clones are produced in such a way that their last nucleotides are special, fluorescently labelled, chain terminating dideoxynucleotides (ddATP, ddCTP, ddGTP and ddTTP), each receiving a dye of a distinct colour (i.e., they fluoresce at distinct wavelengths). When the buffer containing labelled clones is submitted to gel electrophoresis, each partial copy will occupy a distinct band: the first band (the one that is more distant from the start point) will contain monomers, the second band will contain dimers, and so on until the last band (the closest to the start point), which will correspond to full copies of the molecule being sequenced. Upon exposure of the dried gel to UV radiation at specific wavelengths, one can identify the bands associated to each terminator, or equivalently, identify the nucleotide at each position of the original molecule. A sketch of the procedure is shown in Figure 2.10..

(39) 2.2 BIOLOGICAL DATA ACQUISITION. 17. Figure 2.10 Chain termination sequencing method.. 2.2.2. DNA hybridisation arrays. In addition to sequencing technology, another major impulse to genomic research was given by the development of DNA array technology in the 90’s. A DNA array, also called DNA microarray or DNA chip, is a device used to simultaneously measure the expression level of thousands of genes, usually the whole genome of an organism or a significant portion of it, by gauging the amount of transcribed mRNA for each gene. It basically consists of a solid substrate, usually a glass slide or a nylon membrane, which is divided into a grid of spots (or features), each containing an immobilised sample (from tens to hundreds of thousands molecules) of a unique nucleic acid molecule with known sequence. The DNA array is designed so that the molecule of each feature, called the probe, has high hybridisation affinity for the mRNA product, called the target, of a specific gene. The pattern of expression of a set of genes is determined by collecting the complete mRNA product of a cell in a certain condition or at a given moment in time, incubating it with with the array, and measuring how much of it hybridises with the DNA in each feature. Gene expression profiling experiments differ depending on the type of array (oligonucleotide, cDNA) and readout technology (confocal laser scanning, phosphorimagers). They are usually performed in series with the objective of monitoring cellular activity at different growth conditions, or cellular response to diverse stress situations or therapeutic treatment. The measurements of the (relative) expression levels of one particular gene under the series of observed conditions constitute its gene expression profile under those conditions. A series of N profiling experiments using DNA arrays representing M genes results in a set of M gene expression profiles, one for each gene, and each encompassing N experimental.

(40) 18. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. conditions. These profiles (N -dimensional row vectors) are stacked into a M × N matrix called the expression profile matrix. Such a matrix is often represented by a coloured grid of small rectangles, each receiving a colour in a red-green spectrum, with levels of red standing for activation, green for repression, and black for unaltered expression relative to a fixed reference condition (Figure 2.11).. Figure 2.11 Partial view of the expression profile matrix comprising the whole set of S. cerevisiae’s ORFs (only 35 shown) under ∼ 80 experimental conditions. Source: Eisen et al. (1998, Suppl. website). 2.3. Biological networks. Every biological system, from a single bacterial cell to a complete ecosystem, is composed of several elements that act cooperatively in a complex and coordinated way to perform the tasks necessary to the maintenance of life. A great deal of traditional research in the life sciences consists of isolating and dissecting the structural properties of individual components of such systems. Whereas this represents a legitimate and, moreover, necessary investigation practice, it fails to completely explain the dynamical and evolutionary behaviour of biological systems. It was soon realised that a holistic understanding would not be attained otherwise than by the study of the interactions between the parts of those complex systems in parallel with their individual properties. Over the last decades, it has become popular to model biological interactions at a static level as networks. A biological network (Alm and Arkin, 2003) is a network, i.e. a set of interconnected nodes, in which each node represents a component of a biological system and a connection between two nodes represents some kind of interaction or functional associa-.

(41) 2.3 BIOLOGICAL NETWORKS. 19. tion between them. Although simple, this modelling allows for the study of several aspects of the cellular machinery at a molecular level. Relevant questions concern size, topology and evolution of such networks. In the long run, we would like to be able to construct dynamical models reliable enough to permit explanation of manifested behaviour (phenotype) from its biomolecular codification (genotype). Moreover, we would like to build predictive models to let us perform simulations in order to anticipate future states of the cell. It has been observed a notable progress in the study of three kinds of molecular interaction networks: protein-protein interaction networks, metabolic networks, and transcriptional regulation networks. In a protein-protein interaction network, nodes represent proteins and edges represent physical interaction (which usually imply functional association) between the connected proteins. Traditional methodologies used for determining protein interactions include protein co-immunoprecipitation, and interaction mapping through yeast-two-hybrid (Y2H) screens. A metabolic network represents the chains of chemical reactions underlying the metabolism of an organism. In a metabolic network, nodes correspond to compounds, whereas a diE. rected connection labelled E from node A to node B (A −→ B ) indicates that substrate A is used by enzyme E to produce metabolite B . Networks of this kind often arise from innumerable studies targeted at individual enzymes and substances. In this work, we will be mainly concerned with the last of the three kinds of biological network mentioned above. In a transcriptional regulation network (TRN) each node represents a protein or a gene. An arc (directed connection) protein → gene indicates that the protein regulates the gene, whereas an arc in the opposite direction means that the gene encodes the protein. Thus the assembly of transcription networks depends on the identification of the genes of a genome and their respective products, as well as the localisation of protein-DNA interactions. Gene identification (also called gene finding or prediction) is a well-studied problems in computational Biology (Mathé et al., 2002). Current methods are mainly based on homology, i.e., on the sequence similarity with previously characterised genes, on the analysis of the sequence content (nucleotide composition, codon usage bias, hexamer frequency, etc.), and on the recognition of certain functional signals specific to genes in contrast to intergenic, non-protein-coding regions. As for protein-DNA interaction identification, traditional methods include chromatin immunoprecipitation (ChIP) assays, in which a protein is cross-linked in vivo to DNA, which is then sheared, and the resulting fragments bound to protein-specific antibody causing the precipitation of the protein+antibody+DNA stretch complex. After isolation of the protein-bound pieces, the link.

(42) 20. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. is reversed and the protein digested, resulting in DNA which is then purified, amplified, and sequenced at last. A recent technique based on the use of DNA arrays to perform several thousand simultaneous ChIP experiments (ChIP-on-Chip) (Ren et al., 2000) has made possible the construction of fairly comprehensive TRNs of model organisms, like the budding yeast (Lee et al., 2002). Biological networks have been found to exhibit some characteristics that confer them a degree of regularity and ‘good design’ principles compliance beyond what would be expected for random networks (Alon, 2003). Although unexpected, those fortuitous characteristics make those networks, otherwise arbitrarily large and complex, amenable to computational study. In particular, we highlight some features also found in other kinds of networks, such as social networks, or even human-engineered networks, such as computer networks. They are: scale-free connectivity distribution, the use of recurrent topological patterns, and modularity. In scale-free networks (Barabási and Albert, 1999), the probability of finding a k -connected node, P(k ), follows the power-law distribution P(k ) ∝ k −γ , for some constant γ. In practical terms, this means that the networks have only a few highly connected nodes, or hubs. This is a common architectural feature of computer networks, for instance. Also, in a scale-free network the distance between any two nodes is typically shorter than in a random network. This small-world property is also often found in social networks. A common engineering practice is to arrange the simplest components of a system into a restricted set of small basic circuits, and then to re-use these patterns recurrently as building blocks for the construction of even more elaborate subsystems and so forth, leading up to the whole system. For example, in electronics, transistors and resistors are organised into logic gates, and these into flip-flops, full-adders, multipliers, and so on. Once you have a whole bunch of basic components, you connect them in a convenient way to build up the desired system. Similarly, biological networks have been found to possess some small recurrent topological patterns, or network motifs (Milo et al., 2002) that often correspond to specific information-processing tasks such as filtering input fluctuations, or accelerating network throughput (Shen-Orr et al., 2002). In a modular architecture, the nodes of a network can be decomposed into a set of (possibly overlapping) groups, or modules. Although modularity is a somewhat fuzzy concept, a module in a network can be loosely defined as a group of nodes highly interconnected between them and weakly connected to other nodes outside the module. The idea is that modules would correspond to semi-autonomous functional or structural units of the whole system with well-defined interfaces, i.e. a set of input and output nodes that serve to con-.

(43) 2.4 IDENTIFYING TRANSCRIPTIONAL REGULATION MODULES: PROBLEM OVERVIEW. 21. nect it to other modules. Modular networks can be more easily adapted and reconfigured since intra or intermodular adjustments are more isolated and affect potentially much less nodes that in an highly connected non-modular network. Therefore, it seems reasonable that essentially dynamic, often-evolving biological systems display a modular architecture. Modular networks can be more easily treated by computer algorithms. Indeed, instead of considering the network as a whole, it is a common practice to study it from ground up, starting with the modules, one at a time, and then analysing the connectivity between the modules to get a global picture.. 2.4. Identifying transcriptional regulation modules: problem overview. Even though no precise and universal definition exists, a regulatory module can be thought of as a group of genes that are co-regulated under certain conditions. Here, by ‘co-regulated genes’ we understand genes co-regulated at the transcriptional level, i.e., genes that are controlled by common transcription factors and which act coordinately as a functional unit in response to particular growth conditions or environmental stimuli. Therefore, it is more precise to say that we are interested in identifying transcriptional regulation modules (TRM). The task of identifying TRMs in silico depends heavily on the particular definition of modularity and the availability of meaningful data with respect to the assumed definition. With full genome sequence databases being populated at a relentless pace, the identification and annotation of all functional elements in the genome, not only genes but also regulatory sequences, became an important issue. This need gave rise to one of the most important problems in computational Biology, vaguely entitled ‘motif finding’, which appears in the literature under many denominations, for instance, with the word ‘motif’ being replaced by ‘pattern’, ‘signal’ or ‘element’, and the term ‘finding’ being replaced by ‘inference’, ‘detection’ or ‘discovery’. The term ‘motif’ itself is subject to various interpretations. So wellstudied is the problem, that even surveys on the subject abound (e.g.,Brazma et al. (1998); Vanet et al. (1999); Stormo (2000); Rigoutsos et al. (2000); Zhang (2002); GuhaThakurta (2006), to cite a few). In general, motifs are representations of (usually short) recurrent signals in sets of biological sequences which can represent, for instance, binding sites for regulatory proteins. We can explore the fact that genes that are co-regulated by common TFs share motifs representing their binding sites. In this case, we define a regulatory module as a group of genes sharing motifs, and therefore the problem of identifying TRMs depends basically on.

(44) 22. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. the task of identifying common cis-regulatory elements2 in the regulatory regions of genes. The advent of DNA microarray technology and the availability of gene expression data opened new perspectives for the development of computational methods aimed at identifying regulatory modules. Knowing that co-regulated genes behave accordingly, we could define a TRM as a group of genes with coherent (transcriptional) expression profile across a certain set of conditions. Considering the measured expression profiles of the genes as multidimensional vectors, the task of identifying TRMs would correspond to finding groups of vectors whose projections onto a certain subset of features (i.e. specific vector positions) would display a certain level of ‘similarity’, according to some definition. This problem relates then to the problem of multidimensional data clustering or, more specifically, biclustering, which will be more thoroughly discussed in Section 3.2.2. Both in the motif and in the cluster-oriented versions, the problem of identifying TRMs implies designing efficient models, data structures and algorithms for the identification of patterns in voluminous data sets. Two main problems intervene in this task, one intrinsic and the other of a more practical nature. The first problem is that the biological signals are often very subtle, that is, they correspond to rather small chunks of information buried into massive amounts of non-signal, background data. The second difficulty is that current data sets are frequently incomplete and noisy, sometimes so noisy as to mask real signals and induce false ones. Therefore, apart from the natural subtlety and variability of biological patterns, developed methods have yet to cope with additional sources of uncertainty, what renders their design particularly problematic. We are thus asked to put forth strategies for amplifying the signals in order to help them stand out from noise and become more suitable to computational retrieval. A promising strategy is to base the concept of modularity on multiple complementary criteria, backed up by several kinds of data. In our problem setting, we concentrate on the analysis of gene expression and sequence data. As already mentioned, by studying gene expression data, one tries to identify genes that display similar expression patterns across certain subsets of experimental conditions, and then use this to infer co-regulation. Similarly, relying on the hypothesis that co-regulated genes are bound by common TFs, if we can identify motifs shared by some genes, we can use this as an evidence of co-regulation. Now, if a given set of genes are found to display coherent expression patterns and, in addition, they also share cis-regulatory elements, then chances are improved that those genes indeed integrate a regulatory module. 2 It. is common to refer to both TFs and their respective binding sites indistinguishably as regulatory elements. When a clearer distinction is necessary, we call them respectively trans-regulatory elements and cisregulatory elements.

(45) 2.4 IDENTIFYING TRANSCRIPTIONAL REGULATION MODULES: PROBLEM OVERVIEW. 23. A distinguished characteristic of our approach to the study of transcriptional regulation modularity is that we also take into account the evolutionary conservation of modules. We explore the biological hypothesis that relevant functional associations are more subject to selective pressure and thus more likely to be conserved along evolution. We therefore contend that, if a module can be found reasonably conserved in multiple species, then it can be attributed a higher significance, since the odds that they would come out by chance from multiple species data sets simultaneously would be sensibly reduced. In this case, instead of looking for modules in data sets concerning one single species, it would be desirable to perform the search on data sets concerning multiple related species. In the light of the preceding discussion, we formulate the definition of conserved transcriptional regulation module as follows. Definition 2.1 (Transcriptional regulation metamodule, TRMM). Let S = {1, . . . ,S} be a set of related species. For each species k = 1, . . . ,S, let gk represent its set of genes, and ck be a set of conditions of interest w.r.t. transcriptional activity of the genes in gk . A ‘conserved’ transcriptional regulation module, which we call a transcriptional regulation metamodule, takes the form M = ((˜r1 , g˜ 1 , c˜ 1 ), . . . , (˜rS , g˜ S , c˜S )) where, for each k , g˜ k ⊂ gk is a group of coregulated genes, c˜ k ⊂ ck is the group of conditions under which genes in g˜ k are co-expressed, and r˜k is a motif profile, i.e., a set of cis-regulatory elements shared by genes in g˜ k . The first thing that needs to be clarified about the definition above is that by ‘related species’ we mean ‘evolutionary related’. More precisely, we assume that the species in S descend from a common ancestor and that the phylogenetic tree relating these species is known. A phylogenetic tree is a graphic description of the evolutionary inter-relationships among a set of species in which every node corresponds to a species, each internal node is the immediate common ancestor of its children, and the edge lengths represent evolutionary distance estimates in terms of time. Knowing evolutionary relationships among species is important to help making the distinction between irrelevant similarities that occur serendipitously only due to proximity between organisms, and to relevant similarities that are kept due to selective pressure over longer evolutionary distances. We notice that a transcription regulation metamodule (or metamodule, for short) is essentially a set of conserved modules from several species. In order to give a sharper idea of what we consider to be a module, let us take a given species k and concentrate on the corresponding metamodule component M k = (˜rk , g˜ k , c˜ k ). In fact, M k describes a module composed of a group of co-regulated genes, g˜ k , the conditions under which the co-regulation is manifested, c˜ k , and the corresponding cis-regulatory elements, r˜k . According to our con-.

(46) 24. CHAPTER 2 BACKGROUND AND PROBLEM DESCRIPTION. cept of a TRM, specific relations must hold between the components of M k . First, we require motifs in r˜k to be over-represented in the regulatory regions of genes in g˜ k . Second, genes in g˜ k should have similar transcriptional expression profiles under the conditions c˜ k . Although we do not define yet how to exactly measure over-representation nor expression profile similarity, we precise that this information is contained in sequence and gene expression data respectively, and indicate that the intensity of these ‘forces’ holding together the components of M k are used to measure the cohesion of the module. Key to our definition of a metamodule is the fact that, apart from the intramodule relationships described above, we require that the individual modules composing the TRMM also display a level of conservation. However, if the very concept of a module is ill-defined, much less agreement exists on what is meant by a conserved regulatory module. To start explaining our notion of module conservation, let us recall that two genes from distinct species are said to be orthologues if they evolved from a common ancestral gene following a speciation event. A relevant question is to what measure orthologues preserve function. As an indication, some studies show that, to a given extent, orthologous genes conserve coregulation along significant phylogenetic distances (Snel et al., 2004; Okuda et al., 2005). We postulate that the ‘important’ co-regulation relations are evolutionarily conserved and use this property as the core of our module conservation definition. More precisely, given any two species k and s , whose metamodule components are respectively M k = (˜rk , g˜ k , c˜ k ) and M s = (˜rs , g˜ s , c˜ s ), we require that a good matching between gene sets g˜ k and g˜ s w.r.t orthology can be established. In other words, we require that co-regulated genes we choose to integrate individual modules are those whose co-regulation is preserved by evolution. Knowing that orthologues often display a high degree of sequence similarity, we follow the common practice and infer orthology relations based on sequence homology. In principle, this can be misleading since duplicated genes (or paralogues) also present a high level of sequence identity. We expect, however, that intramodule co-regulation relationships and intermodular gene set orthology will complement each other and help to identify true associations. Not only we expect that distant orthologues which diverged in sequence but which are still functionally equivalent can be identified, but also paralogues which diverged in function to be filtered out by failing to comply with co-expression requirements, even though sequence homologues exist in other species modules. As mentioned, biological sequence motifs are hard to identify since they are usually small and degenerate. Hopefully, for this problem too, we can find help in the fact that important functional sequences are more subject to selective pressure and therefore evolve slower. This is the principle of the method known as phylogenetic footprinting (discussed.

Referências

Documentos relacionados

Circulating FGF21 levels are increased in insulin-resistant states; however, endogenous FGF21 fails to improve glucose and lipid metabolism in obesity, suggesting that

We profiled the transcriptional changes in mouse liver before and after 5 days of T3 treatment and examined the expression of genes related to the regulation of cholesterol and

To determine whether the cell cycle regulated turnover we observed was sufficient to gener- ate periodic expression of Ndd1 in the absence of transcriptional regulation, we

Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology.

inverse regulation of expression of mRNAs and their specific miRNAs, we determined (MicroCosm Target algorithm) for each miRNA its potential target(s) and the regulatory pathways

social assistance. The protection of jobs within some enterprises, cooperatives, forms of economical associations, constitute an efficient social policy, totally different from

In this work, we present an updated version of the TR network of Mycobacterium tuberculosis ( M.tb ), which incorporates newly characterized transcriptional regulations coming from

The genes having expression significantly above the background present in cluster 1 show enrichment of state that corresponds to active promoter and transcribed region, as well