Statistical analysis of evolution by genome rearrangements = Análise estatística da evolução por rearranjo de genomas

(1)

INSTITUTO DE COMPUTAÇÃO

Priscila do Nascimento Biller

Statistical analysis of evolution by genome

rearrangements

Análise estatística da evolução por rearranjo de

genomas

CAMPINAS

2016

(2)

Statistical analysis of evolution by genome rearrangements

Análise estatística da evolução por rearranjo de genomas

Tese apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutora em Ciência da Computação.

Dissertation presented to the Institute of Computing of the University of Campinas in partial fulfillment of the requirements for the degree of Doctor in Computer Science.

Supervisor/Orientador: Prof. Dr. João Meidanis

Este exemplar corresponde à versão final da Tese defendida por Priscila do Nascimento Biller e orientada pelo Prof. Dr. João Meidanis.

CAMPINAS

2016

(3)

Ficha catalográfica

Universidade Estadual de Campinas

Biblioteca do Instituto de Matemática, Estatística e Computação Científica Maria Fabiana Bezerra Muller - CRB 8/6162

Biller, Priscila do Nascimento,

B494s BilStatistical analysis of evolution by genome rearrangements / Priscila do Nascimento Biller. – Campinas, SP : [s.n.], 2016.

BilOrientador: João Meidanis.

BilTese (doutorado) – Universidade Estadual de Campinas, Instituto de Computação.

Bil1. Biologia computacional. 2. Filogenia - Matemática. 3. Método dos momentos (Estatística). 4. Teoria dos grafos. 5. Comparação genômica. 6. Evolução molecular - Simulação por computador. I. Meidanis, João,1960-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Análise estatística da evolução por rearranjo de genomas Palavras-chave em inglês:

Computational biology Phylogeny - Mathematics Method of moments (Statistics) Graph theory

Comparative genomics

Molecular evolution - Computer simulation

Área de concentração: Ciência da Computação Titulação: Doutora em Ciência da Computação Banca examinadora:

Cid Carvalho de Souza João Carlos Setubal Sergio Russo Matioli Zanoni Dias

Fábio Luiz Usberti

Data de defesa: 15-08-2016

Programa de Pós-Graduação: Ciência da Computação

(4)

INSTITUTO DE COMPUTAÇÃO

Priscila do Nascimento Biller

Statistical analysis of evolution by genome rearrangements

Análise estatística da evolução por rearranjo de genomas

Banca Examinadora:

• Prof. Dr. Cid Carvalho de Souza Instituto de Computação, UNICAMP • Prof. Dr. João Carlos Setubal

Instituto de Química, USP • Prof. Dr. Sergio Russo Matioli

Instituto de Biociências, USP • Prof. Dr. Zanoni Dias

Instituto de Computação, UNICAMP • Prof. Dr. Fábio Luiz Usberti

Instituto de Computação, UNICAMP

A ata da defesa com as respectivas assinaturas dos membros da banca encontra-se no processo de vida acadêmica do aluno.

(5)

In the same way as statistical methods have a model to guide them, I am also surrounded by people who inspire my life and work. My family, Rodolfo, Sandra, and Rodolfo Rodrigo, has been my foundation and inspiration for everything, and to whom I dedicate this work. During the evolution of this thesis, all types of moments have happened, some rejoicing, others harsh. Looking back over them, it would have been impossible to keep my state of equilibrium without the great support from my friends: Vinicius de Novaes, Mário Felice, Mariana Felice, Javiera Gonzalez, Ramom Santana, Alex Oliveira, Juliana Destro, Gustavo Waku, Sheila Cáceres, Esteban Rodriguez, Aloísio Vilas-Bôas, and all people from LOCo. Thanks for the friendship along the years.

Among these moments, the one year I stayed abroad provoked large-scale rearrange-ments in many aspects of my life. Thanks to Jules Lallouette, Ilya Prokin, Marine Jacquier, Charles Rocabert, Sergio Peignier, Thiago Kanthack, Marcelo Sandrini, Duarte Gouveia, and all people from Beagle and Dracula teams, I felt at home during my stay, and it will surely be a great pleasure and excitement to share new moments with you.

Many agents were responsible for the development and evolution of this thesis. In Campinas, my advisor João Meidanis, João Paulo Zanetti, and my colleagues from LOCo; in Lyon, Eric Tannier, Laurent Guéguen, and the team from LBBE, as well as Carole Knibbe, Guillaume Beslon, and all professors and students from Beagle team at INRIA. I thank you all for the collaboration in papers, engaging chats, and exchange of knowledge. I would like to especially thank Eric Tannier for supervising my internship abroad, providing the ideal environment for this work evolves beyond that, with in-depth discussions about the problems and methods, and also the openness and patience to discuss a topic more than once until I digest it.

Last but not least, I want to express my gratitude to my advisor João Meidanis for the guidance and honest advice during all these years; the staff of Institute of Computing, specially the secretary, for all support along the last decade; University of Campinas for its excellent infrastructure; INRIA for receiving me during the internship; and FAPESP for the financial supports 2012/14104-0 and 2013/25084-2.

(6)

O método comparativo em biologia evolutiva consiste em detectar similaridades e diferen-ças entre os genomas existentes e, baseado em hipóteses mais ou menos formais sobre os processos evolucionários, inferir estados ancestrais que expliquem as similaridades e uma história evolutiva que explique as diferenças. Um problema clássico consiste em comparar dois genomas e estimar a quantidade de mudanças evolutivas que ocorreram nas linha-gens que os separam. As mudanças evolutivas nos genomas podem ocorrer em diferentes escalas, desde mutações em um único nucleotídeo até grandes rearranjos cromossomais.

Nesta tese apresentamos novos modelos de evolução por rearranjos, e estimativas es-tatísticas baseadas neles. Primeiro propomos uma fórmula fechada, exata e invertível analiticamente para o número esperado de breakpoints após um dado número de opera-ções Double-Cut-and-Join (DCJ). Este resultado melhora a fórmula anteriormente pro-posta, que é recursiva, heurística e mais lenta de ser computada. Então estabelecemos elos formais entre evolução de genomas por DCJ e três processos bem conhecidos (subs-tituições em sequências binárias, transposições em permutações e grafos aleatórios) e, consequentemente, firmamos a teoria ou corrigimos as intuições dos estudos precedentes. A fim de validar a habilidade em estimar o número de rearranjos em dados biológicos e produzir benchmarks para estudos em rearranjo de genomas, usamos a ferramenta Aevol, uma plataforma de evolução in silico desenvolvida para entender os processos da evolução estrutural em genomas. Testamos diversos estimadores baseados em modelos tradicio-nais de evolução por inversões, e mostramos que a maioria dos estimadores estatísticos e combinatórios, que se comportavam perfeitamente em simulações ad-hoc, falharam neste conjunto de dados. Os simuladores ad-hoc frequentemente codificam as mesmas simplifi-cações e hipóteses dos métodos de inferência. Entretanto, os sistemas de vida artificial e os modelos in silico da evolução de genomas são independentes e baseados em princípios biológicos mais sofisticados que a maioria dos simuladores ad-hoc. Consequentemente, supomos que os dados produzidos são provavelmente mais próximos dos dados biológicos. Posteriormente fizemos uma verificação mais aprofundada das falhas identificadas, que recaem em duas categorias: uma é ignorar a heterogeneidade da suscetibilidade de quebras nas regiões genômicas, e a outra é supor que o número de regiões suscetíveis a quebras é conhecido. Subsequentemente propomos um modelo de evolução por inversões no qual as probabilidades de quebra variam entre as regiões e pelo tempo, que contém como caso particular o modelo uniforme de quebras da sequência de nucleotídeos, em que as probabilidades de quebra são proporcionais aos tamanhos das regiões frágeis. Neste caso particular, a distribuição no equilíbrio é similar a distribuição dos tamanhos das regiões intergênicas de diversos organismos. Este modelo é muito diferente do modelo fre-quentemente usado, no qual todas as regiões frágeis tem a mesma probabilidade de serem quebradas. Os estimadores baseados em nosso modelo obtiveram performances significa-tivamente melhores em dados simulados, além de fornecerem os resultados mais plausíveis em pares de genomas amnióticos quando o número de regiões frágeis foi coestimado.

(7)

The comparative method in evolutionary biology consists in detecting similarities and differences between extant organisms, and, based on more or less formalized hypotheses on the evolutionary processes, infer ancestral states explaining the similarities and an evo-lutionary history explaining the differences. A classical problem in comparative genomics is to compare two genomes and estimate the amount of evolutionary change that has oc-curred in the lineages separating them. Evolutionary changes in genomes can happen at different scales, from single nucleotide mutations to large chromosomal rearrangements.

In this thesis we present new models of evolution by rearrangements, and statistical estimations based upon them. We first propose an exact, closed, analytically invertible formula for the expected number of breakpoints after a given number of Double-Cut-and-Join (DCJ) operations. This improves over the heuristic, recursive and computationally slower previously proposed formula. Then we establish formal links between genome evolution by DCJ and three well-known processes (binary sequences under substitutions, permutations under transpositions, and random graphs), and in consequence theoretically found or correct the intuitions of former studies.

In order to validate the ability to estimate the number of rearrangements in biological data and to produce benchmarks for rearrangement studies, we used Aevol, an in silico experimental evolution platform designed to understand processes of genome structural evolution. We tested several estimates based on traditional models of evolution by inver-sions, and showed that most combinatorial and statistical estimators, which were behaving perfectly on ad-hoc simulations, failed on this dataset. Ad-hoc simulations very often en-code the same simplifications and assumptions as the inference methods. Artificial life systems and in silico models of genome evolution are however independent and based on more sophisticated biological principles than most ad-hoc simulators. In consequence, we argue that the data they produce is probably closer to actual biological data.

We then provide an in-depth examination of the flaws that we identified in the an-alyzed models. These flaws fall in two categories: one is to ignore the heterogeneity of susceptibility to breakage across genomic regions, and the other is to suppose that the number of susceptible regions is given. We then propose a model of evolution by inver-sions where breakage probabilities vary across regions and over time. It subsumes as a particular case the uniform breakage model on the nucleotidic sequence, in which break-age probabilities are proportional to fragile region lengths. In this particular case, the equilibrium distribution in the model resembles the distribution of intergene sizes from diverse organisms. This model is very different from the frequently used model in which all fragile regions have the same probability to break. Estimates based on our model had significantly better performances on simulated data, and gave the most plausible results on pairs of amniote genomes when the number of susceptible regions was co-estimated.

(8)

2.1 Genome . . . 16

2.2 Genome rearrangements . . . 18

2.3 Double-Cut-and-Join (DCJ) operation . . . 19

2.4 Markovian model of genome evolution . . . 21

2.5 Method of moment estimator for genome evolution . . . 25

3.1 Two genomes, with their real and observed breakpoint graphs . . . 38

3.2 Evolution of genomes, sequences, permutations or graphs . . . 42

3.3 4×4 correction when estimating a distance through binary sequence evolution 44 3.4 Comparisons of some DCJ estimators on a simulation from the DCJ model 50 4.1 Overview of the Aevol model . . . 56

4.2 The results of 7 estimators: ID, Badger, EH, gDCJ , ER1, ER2, AA . . . . 57

5.1 Transformation of a genome into a permutation of genes . . . 64

5.2 Transition probabilities of INFER . . . 65

5.3 Density of intergene sizes from INFER and biological data . . . 67

5.4 Behavior of rearrangement distance estimators . . . 69

5.5 Difference between the estimation of the genomic distance and the inversion distance on amniote genomes . . . 74

5.6 Difference between the estimation of the genomic distance and the inversion distance on amniote genomes, considering chromatin regions . . . 76

5.7 Difference between the estimation of the genomic distance and the inversion distance on yeast genomes . . . 77

5.8 Simulation on multichromosomal genomes . . . 78

5.9 The effect of removing lonely genes . . . 79

5.10 Estimations of the number of fragile regions . . . 85

5.11 Number of fragile regions estimated by ER2 . . . 86

(9)

(10)

1 Introduction 12

2 Background 15

2.1 Genome evolution by rearrangements . . . 16

2.1.1 Double-Cut-and-Join (DCJ) and other k-break operations . . . 17

2.2 Measuring observable differences: Breakpoint graph . . . 17

2.3 Modelling genome evolution with Markovian models . . . 20

2.3.1 Evolution in molecular sequences . . . 21

2.3.2 Evolution in the symmetric group . . . 22

2.3.3 Evolution in random graphs . . . 23

2.4 Method of moment estimators for genome evolution . . . 23

2.4.1 Observable parameter: Breakpoints and cycles . . . 24

2.4.2 Estimation quality: Exact, approximated, and heuristic solutions . 27 2.4.3 Running time: Recursive, numerically invertible, and analytically invertible formulas . . . 29

2.5 Alternative ways to estimate evolutionary distance . . . 32

2.5.1 Estimates for genomes with duplications . . . 32

2.5.2 Baeysian estimates . . . 32

2.5.3 Parsimony-based estimates . . . 33

2.6 Related work in a nutshell . . . 33

3 Moments of genome evolution by Double-Cut-and-Join 36 3.1 New model of evolution by DCJs . . . 36

3.1.1 Genomes and DCJ . . . 37

3.2 Closed formula for the expected number of breakpoints . . . 39

3.3 Formalizing links with other models . . . 41

3.3.1 A link with sequence evolution . . . 41

3.3.2 A link with transpositions in the symmetric group . . . 45

3.3.3 A link with random graphs . . . 48

3.4 Empirical comparisons . . . 49

3.5 Discussion . . . 51

4 Benchmarking evolutionary studies with artificial life 52 4.1 Validation of evolutionary inferences . . . 53

4.2 Comparative genomics: Estimating an inversion distance . . . 54

4.3 Artificial life: In silico experimental evolution and the Aevol platform . . . 54

4.4 Inversion distance estimators on artificial genomes . . . 55

(11)

5.1 The pseudo-uniform model and its pitfalls . . . 61

5.2 The evolutionary model INFER and its stationary distribution . . . 62

5.2.1 Genomes and inversions . . . 62

5.2.2 Transition Probabilities . . . 63

5.2.3 Equilibrium distribution . . . 65

5.2.4 The uniform model, a particular case of INFER . . . 66

5.3 Distance estimators for simulated genomes with known fragile regions . . . 68

5.3.1 The behavior of pseudo-uniform-based distance estimators . . . 68

5.3.2 An INFER-based distance estimator: ER1 . . . 70

5.4 A distance estimator for real genomes, with unknown fragile regions . . . . 72

5.4.1 Why the number of solid and fragile regions is unknown . . . 72

5.4.2 Co-estimating the distance and the number of fragile regions: ER2 . 79 5.5 Discussion . . . 84

5.5.1 Slow and fast evolving sites . . . 84

5.5.2 What is a uniform model of genomic breakage? . . . 87

5.5.3 Towards a general model for genome rearrangements . . . 88

5.5.4 Limits . . . 90

6 Conclusions 91 6.1 Future work . . . 92

(12)

Introduction

The comparative method in evolutionary biology consists in detecting similarities and differences between extant organisms, and, based on more or less formalized hypotheses on the evolutionary processes, infer ancestral states explaining the similarities and an evolutionary history explaining the differences.

A classical problem in comparative genomics is to estimate the amount of evolutionary change between two genomes. Evolutionary changes in genomes can happen at different scales, from single nucleotide mutations to large chromosomal rearrangements.

Genomes rearrangements gave birth to one of the first computational biology problems, long before the structure of DNA was known and exploited as a document for evolutionary history [83]. Despite this antecedence, genome rearrangements are still much more diffi-cult to exploit in molecular evolutionary studies than DNA or protein sequences, mainly because of the mathematical and computational complexity of the models.

Various combinatorial [41] and statistical [37] methods have been developed to infer the number of rearrangements separating two genomes. Indeed, comparing the organization of two genomes under a parsimony assumption already gives rise to a whole field of combinatorics (see [41] for a survey on it).

In this thesis we studied probabilistic models and statistical estimators, which are able to provide much more accurate solutions and sometimes bypass the computational complexity of combinatorial approaches. In Chapter 2 we give an overview of statistical methods and their relations with various known Markov processes, such as random graphs, transpositions in the symmetric group, and coagulation-fragmentation. We formalize the relations between genome evolution by inversions—a rearrangement that reverses the reading direction of a genomic segment—and these three well-known processes, and, in consequence, theoretically founded or corrected the intuitions of former studies.

In an attempt to simplify genome rearrangement studies, a more general model was proposed a decade ago [96], encompassing inversions and other biological rearrangements, such as transpositions and translocations. This model is called Double-Cut-and-Join (DCJ) and it became very popular among combinatorial studies [41]. We showed here how DCJ can also significantly enrich and simplify statistical methods of moment estimations. We introduced a “mechanistic” DCJ model, focused on breakage probabilities rather than on events, which allows one to obtain a closed, analytically invertible, exact formula for the expected number of breakpoints after a fixed number of DCJs. This formula greatly

(13)

improves the previously published estimation [58], which was based on an unbounded approximation, computed by a recurrence and thus not easily invertible. Chapter 3 is devoted to all results mentioned so far.

In a second part of our study, we dealt with a common concern in all evolutionary studies: the validity of the methods and results. Concerning rearrangement studies, their validation is still an issue, as we have almost no direct access to ancient molecules, and simulations are often based on the same assumptions of the methods, leading to easy unrealistic instances. The standard and most used model depicts genomes as permuta-tions of genes and assumes that an inversion reverses a segment of the permutation, taken uniformly at random over all segments. When simulators are designed to validate the es-timators, they also use permutations as models of gene orders, and inversions on segments of this permutations, chosen uniformly at random. Estimators show good performances on such simulations, but transforming a genome into a permutation of genes is such a simplification from both parts that it means nothing about any ability to estimate a rear-rangement distance in biological data. It is unrealistic, unlikely and unstable to assume that fragile regions all have the same size and keep the same size during a rearrangement scenario.

We proposed to use simulations that were not designed for validation purposes. It is the case, in artificial life, of in silico experimental evolution [49], and in particular of the Aevol platform [6, 54]. We showed that most combinatorial and statistical estimators of the number of inversions [3, 10, 14, 22, 35, 47, 56, 58] fail on this dataset while they were behaving perfectly on ad-hoc simulations. We argued that biological data is probably closer to the difficult situation. More details of this experimental study and a description of the Aevol platform are given at Chapter 4.

In the last part of this thesis we provided an in-depth examination of the flaws detected. We scrutinize the models and question the null hypothesis adopted by them. We proposed a switch of the null hypothesis: instead of a uniform weight or probability for all inversions, we designed a model in which breakage probabilities vary across fragile regions and over time. This new model is called INFER (“INversions in FragilE Regions”) and contains as a particular case a corrected translation of the Nadeau-Taylor hypothesis [68] from the nucleotide level to the rearrangement level, that is, fragile regions are broken with a probability proportional to their sizes, in terms of number of nucleotides. In this particular case, the equilibrium distribution of the model resembles the distribution of intergene sizes from diverse organisms.

A second problem arises from the assumption that solid regions are known. In prac-tice, comparing genome organizations begins with preparing homologous loci in different genomes, which can be either a selection of orthologous sets of genes, or synteny blocks made from genes or genomic alignments [78]. However, real fragile regions could lie within such loci, and real solid regions could lie between two consecutive loci. This makes those statistical estimations depend on the arbitrary choices of data preparation. Here we showed that the INFER model can be used for statistical inference with or without the knowledge of the solid and fragile regions, whose number can be estimated, as well as with or without the knowledge of the breakage probabilities, which can be assumed to be distributed according to a Dirichlet law. Indeed, we first showed that the standard

(14)

identification of genes with solid regions and intergenes with fragile regions yields inco-herent results on biological data, and actually the estimated number of fragile regions is surprisingly small, suggesting that a minority of regions are recurrently used by rear-rangements. Finally, we showed that our estimator, designed to take into account the distribution of intergene sizes, has significantly better performances on artificial genomes generated by Aevol and, considering the number of fragile regions unknown, is the only one which presents coherent results on biological data. Chapter 5 describes INFER and the estimators derived from it, as well as the results obtained using simulated and biological data.

The results presented in this thesis have been published as follows: the first part of this thesis, corresponding to Chapter 3, was presented at the conference RECOMB-CG 2015 and later published in BMC Bioinformatics [14]; the second part (Chapter 4) was presented at the conference “Computability in Europe” [15]; the last part (Chapter 5) was published at Genome Biology and Evolution [16]. Chapter 6 gives concluding remarks and possibilities of future work. This study was conducted in collaboration with Beagle Team [51] and Prof. Eric Tannier.

(15)

Background

In this chapter we review existing methods for the problem of inferring the number of evolutionary events separating two organisms given their genomes. We proceed in this review by first introducing basic concepts about evolution in genomes, and then a common approach for modelling evolution, which will allow us to present estimators derived from similar mathematical models.

To provide the necessary background for the results presented in this thesis, this chapter is organized in a way to answer the following questions:

• Section 2.1: How do genomes evolve?

The problem here studied arises from evolutionary genomics, a field of biology aim-ing at understandaim-ing better the mechanisms of evolution on genomes. Thus, before focusing on the mathematical aspects of the methods, Section 2.1 gives a rapid introduction to required concepts from biology and the basic terminology.

• Section 2.2: How to measure observable differences caused by evolution?

As it is rare to have access to ancient information such as DNA ancient sequences or fossils, the extant genomes are a valuable source to infer the path taken by evo-lution. Evolutionary signal can be detected through the differences between related genomes. A way to visualize differences on the arrangement of genomic regions is through a structure called breakpoint graph, vastly used in genome rearrangement studies and explained in Section 2.2.

• Section 2.3: How to model evolution?

A mathematical model of evolution is a simplification of the evolutionary process and tries to explain how evolution could have originate the observed differences in extant genomes. The formulation of a mathematical model makes simplifying assumptions about how evolutionary events affect genomes and with what frequency they occur. In Section 2.3 we describe Markovian models for genome evolution and their relation with other evolutionary processes.

• Sections 2.4 and 2.5: How to estimate the number of evolutionary events separating two genomes?

(16)

Ancestral genomic information can be estimated through the observable differences of extant genomes and a model of evolution. Here we focused on method of mo-ments estimators to infer the number of evolutionary events, further detailed in Section 2.4. In Section 2.5 alternative methods are mentioned. This chapter ends with a summary of the reviewed estimators in Section 2.6, focusing on the ones analysed in this thesis.

2.1 Genome evolution by rearrangements

The genome of an organism is all information that determines inherited characteristics, usually carried in DNA molecules (Figure 2.1). Most DNA molecules consist of two strands held together through hydrogen bonding, and made up of simpler units called nucleotides, with four distinct monomers called adenine (A), cytosine (C), guanine (G), and thymine (T). Chromosome Genome G en es DNA Genes contain instructions for assembling proteins Proteins Proteins execute and regulate cellular functions

Figure 2.1: A genome carries all informa-tion that determines inherited characteris-tics. Genetic information is coded by dou-ble stranded DNA, stored in cell nucleus in the form of chromosomes. Source: site “The Science Creative Quarterly” [73], artist: Jiang Long.

The antiparallel orientation is an important characteristic of the double-stranded DNA, and it means that they are read in opposite directions during the DNA processes, such as replication and tran-scription. The order matters during the read process: The information encoded in the sequence TCAG is different from the information in the sequence GACT.

A chromosome is a molecule of DNA combined with proteins, called histones, that support its structure. In many eu-karyotes, like plants and animals, the chromosomes are linear sequences, and the chromosome ends are called telomeres. Conversely, on bacteria and other eukary-otic organisms, the DNA molecules are circular sequences, without endings. A genome is unichromosomal if it contains only one chromosome, and multichromoso-mal otherwise. Notice that DNA molecules are also stored in organelles, such as mito-chondria and chloroplasts.

Evolutionary events modify the ge-netic composition of organisms, and such

changes are inherited by descendants, perhaps being incorporated in species over time. An evolutionary event acts on a genome by breaking the bond between two nucleotides in the DNA sequence, or changing the base of a single nucleotide. Those changes can affect a genome in different scales, ranging from the substitution of a simple base in the DNA

(17)

sequence to the movement of big segments. Here we focus on large-scale evolutionary events, called rearrangements.

The DNA sequence can be “discretized” into regions, which can be oriented, if they carry the reading direction of its strand, or unoriented otherwise. Knowing that breakages inside genes [57] or conserved intergenic regions [65] are very often selected against, for rearrangement studies we can discretize the genome into regions that are improbably broken by rearrangements, called solid, and regions outside solid ones, called fragile.

Solid regions are oriented sequences of DNA, not necessarily with the same size, in terms of number of nucleotides. Along this thesis we will keep the term solid regions instead of genes, used in previous studies, since a solid region can be also another kind of syntenic region, or even single nucleotides, if we want to eliminate completely any chance of rearrangement inside the region. A genome can thus be seen as an ordering of solid regions along chromosomes, and each solid region has an indication of its reading direction.

Rearrangements change how those solid regions are ordered on a chromosome in several ways: They may change the reading direction, the number of chromosomes, or even the genomic content, including new solid regions, duplicating them or removing existing ones. Most common studied rearrangements include inversions, that revert the order of a segment of solid regions, as well as their reading directions; translocations, that exchange the telomere-containing segments of two chromosomes; and block interchanges, that swap two segments of solid regions in the same chromosome (transpositions move a block to another place in the same chromosome, which has the same effect as swapping two adjacent segments). Figure 2.2 exemplifies various rearrangement operations observed on biological data and commonly modelled in rearrangement studies.

2.1.1 Double-Cut-and-Join (DCJ) and other k-break operations

Instead of modelling an specific event, more general rearrangement operations have been proposed, called k-breaks. A k-break operation makes up to k breaks in a genome and further glues the resulting pieces in a new order, encompassing more than one event observed on biological data.

Double-Cut-and-Join (DCJ) is a mathematical operation introduced a decade ago [96], and it became very popular in rearrangement studies. A DCJ operation is a 2-break operation, i.e., consists of making a pair of cuts in the chromosomes and recombine the cut ends in a different way. Many standard rearrangements, such as inversions and translocations, can be modelled by 2-breaks. Others, like transpositions (3-breaks) or block interchanges (4-breaks), require more than one DCJ to be performed. Due to its generality, the DCJ model also encompasses a few operations not so relevant, such as circular incisions and excisions, as shown in Figure 2.3.

2.2 Measuring observable differences: Breakpoint graph

In the methods presented in this overview as well as the methods proposed in this thesis, the given genomes should have exactly the same unique solid regions, that is, no solid

(18)

ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGCAGTTGAGGACGTAATCTCCAATGCCCATATTAGCGTATCCGAT1 2 3 4 GACGTCAACTCCTGCATTAGAGGTTACGGGTATAATCGCATAGGCTA Duplication modi fy the reg ion or der and the cont ent Insertion Deletion Translocation Fusion Fission modi fy onl y the regi on or der mul tich rom osoma l events Inversion Transposition Block Interchange ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGCAGTTGAGTGGGCATTGGAGATTACGTCTATTAGCGTATCCGAT1 4 2 GACGTCAACTCACCCGTAACCTCTAATGCAGATAATCGCATAGGCTA 3 ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGAATGCCCACGTAATCTCCCAGTTGAGGATATTAGCGTATCCGAT3 2 1 4 GACTTACGGGTGCATTAGAGGGTCAACTCCTATAATCGCATAGGCTA ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGCAGTTGAGCAATGCCCATGACGTAATCTCATTAGCGTATCCGAT1 3 2 4 GACGTCAACTCGTTACGGGTACTGCATTAGAGTAATCGCATAGGCTA CTGCAGTTGAGGACGTAATCTCCAATGCCCATCTAGTCAGTCAGTAGTCGTAGT1 2 3 6 7 GACGTCAACTCCTGCATTAGAGGTTACGGGTAGATCAGTCAGTCATCAGCATCA ATGCGATCTGATTAGCGTATCCGAT 5 TACGCTAGACTAATCGCATAGGCTA 4 ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGCAGTTGAGGACGTAATCTCCAATGCCCATATTAGCGTATCCGAT1 2 3 4 GACGTCAACTCCTGCATTAGAGGTTACGGGTATAATCGCATAGGCTA ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGCAGTTGAGGACGTAATCTC1 2 GACGTCAACTCCTGCATTAGAG CAATGCCCATATTAGCGTATCCGAT 3 4 GTTACGGGTATAATCGCATAGGCTA ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGTAATCTCCAGTTGAGGACGTAATCTCCAATGCCCATATTAGCGTATCCGAT2 1 2 3 4 GACATTAGAGGTCAACTCCTGCATTAGAGGTTACGGGTATAATCGCATAGGCTA CTGCAGTTGAGGACGTAATCTCCAATGCCCATATTAGCGTATCCGAT1 2 3 4 GACGTCAACTCCTGCATTAGAGGTTACGGGTATAATCGCATAGGCTA ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGTGCTAACCA 5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCACGATTGGT 8 ATGCGATCTGCTAGTCAGTCAGTAGTCGTAGT5 6 7 TACGCTAGACGATCAGTCAGTCATCAGCATCA CTGCAGTTGAGCAATGCCCATATTAGCGTATCCGAT1 3 4 GACGTCAACTCGTTACGGGTATAATCGCATAGGCTA a ﬀ ect on ly on e chrom osom e Reference genome

Reference genome after an inversion containing solid regions 2 and 3 Reference genome after a transposition of the solid regions 2 and 3 Reference genome after a block interchange of solid regions 1 and 3 Reference genome after a translocation of solid regions 6 and 7, and solid region 4 Reference genome after a fusion of the two chromosomes

Reference genome after a ﬁssion between solid regions 2 and 3

Reference genome after a duplication of a segment containing the solid region 2 Reference genome after an insertion of a new segment containing the solid region 8 Reference genome after a deletion of a segment containing the solid region 2

Figure 2.2: Genome rearrangements. A genome is “discretized” into regions and repre-sented by an arrangement of these regions. A rearrangement breaks the genome in one or more places and change the order of these regions: inversions revert the order and read-ing direction of consecutive solid regions; transpositions move a block to another place in the same chromosome; block interchanges swap two segments of solid regions in the same chromosome; translocations break away a piece of chromosome and attach it to another chromosome, which may cause a change in the reading direction; fusions join two chro-mosomes into one; fissions are the reverse operation of fusions; duplications copy a region and insert the copy in a new place of the genome; insertions attach a new region on the genome; deletions take a regions and remove it from the genome. Other rearrangements are also possible, such as tandem duplications and inverted transpositions.

(19)

CTGCAGTTGAGGACGTAATCTCCAATGCCCATATTA1 2 3 GACGTCAACTCCTGCATTAGAGGTTACGGGTATAAT

CTGCAGTTGAGGAGGAGATTACGAATGCCCATATTA1 3 GACGTCAACTCCTCCTCTAATGCTTACGGGTATAAT

(1) Choose two fragile regions (2) Break the chosen fragile regions

(3) Reconnect the broken fragile regions in another way

CGTAATCTCC 2 GCATTAGAGG AATGCCCATATTA3 TTACGGGTATAAT CTGCAGTTGAGGA 1 GACGTCAACTCCT OR

Inversion Circular excision

Double-Cut-and-Join (2-break operations)

2 CTGCAGTTGAGGAAATGCCCATATTA1 3 GACGTCAACTCCTTTACGGGTATAAT C G T A A T C T C C G CA TT AG AG G 2

Figure 2.3: Double-Cut-and-Join (DCJ) operation. A DCJ operation is a 2-break opera-tion, i.e., consists of making a pair of cuts in the chromosomes and recombine the cut ends in a different way. Inversions, translocations, circular incisions, and circular excisions are among the evolutionary events modelled by one DCJ, depending on the places where the cuts are made and how the cut extremities are reconnected.

region is missing or duplicated. Despite duplications, insertions and deletions of segments are large-scale mutations, they are not considered in the studied rearrangement models. To overcome this constraint on biological data, an additional pre-treatment step is usually applied on the genomes to deal with unequal gene content [76].

Looking at a solid region on the DNA strand, one of its extremities can be consecutive to only one another extremity or, in case of a linear chromosome, it can be located in the extremity of a chromosome and not being consecutive to any other extremity. Extremities not adjacent to any other extremities in the genome are called telomeres. An adjacency is a pair of two consecutive extremities, read in either directions.

The breakpoint graph is roughly defined as the union of the adjacencies of two genomes defined on the same set of solid regions. It represents the extremities as vertices and connects two vertices with one edge if the corresponding extremities are adjacent in one of the given genomes. Each genome can be seen as a matching, that is, any two edges do not share vertices, which will be a perfect matching if the genome has no telomeres. Thus the breakpoint graph is a union of two matchings, and its components are cycles and paths, or just cycles if the genomes are both circular.

A common adjacency of two genomes is an adjacency present in both, in one reading direction or the other, whereas a breakpoint is an adjacency present in one genome but not in the other. The number of common adjacencies corresponds to the number of cycles with 2 edges in the breakpoint graph (2-cycles), and the number of breakpoints is the same as the number of edges not present in 2-cycles.

The number of 2-cycles provides the same information as the number of common adjacencies, the number of breakpoints, as well as any rearrangement distance defined on them, like the Single-Cut-or-Join distance (SCJ) [39]. However, not all of them are symmetric; for instance, if one of the two genomes has linear chromosomes, the number of different adjacencies may change depending on the genome taken as reference.

(20)

2.3 Modelling genome evolution with Markovian

mod-els

Evolution is generally seen as a stochastic process, in which rearrangements or other mutations happen at random and can eventually become fixed in a population with a probability that depends on the accompanying change in the species’ fitness. Besides, evolution is very generally memoryless, in the sense that inheritance depends solely on the parents and not on the entire series of ancestors. A model that would aim at representing the way genomes evolve should therefore incorporate those characteristics, which makes Markov chains a good candidate to model biological evolution.

Markov chains (or Markov processes) are memoryless stochastic processes. Formally, a stochastic process is a collection X(t) of random variables, where t is typically time, and the Markovian property is defined by (for a discrete-time process):

P r(X(t + 1) = xt+1|X(t) = xt, X(t − 1) = xt−1, . . . , X(1) = x1, X(0) = x0)

=P r(X(t + 1) = xt+1|X(t) = xt)

for all states x0, x1, . . . , xt−1, xt, xt+1 of the process and any time t. More intuitively, this

means that the future of the process (that is, the various states possibly reached and their probabilities of occurrence) depends only on the present state, not on past states (i.e. the pathway followed to reach the current state). Markov processes can be in discrete time, when states are assigned to successive “steps,” or “generations,” or in continuous time, when the time to next event is an exponential random variable. The space of states can be discrete (finite or infinite) or continuous.

To illustrate how Markov processes can be applied to the problem studied, the evolu-tionary process by rearrangements can be modelled as a finite-state, discrete-time Markov chain, where states are all possible arrangements of solid regions and transitions from one state to another are defined according to the rearrangement operations taken into con-sideration. For instance, Figure 2.4 shows a model of evolution by inversions for genomes with two solid regions, assuming that all inversions are equally probable to occur in a genome.

As a last remark, before we present related Markovian models, some assumptions have to be made to render our knowledge of molecular biology into a mathematically tractable form:

• Solid regions are known: Despite solid regions and genes are often used interchange-ably in rearrangement studies, in Chapter 5 we show that on biological data our estimates of the number of fragile regions are surprisingly low, an order of magni-tude lower than the number of intergenes, as predicted by [72]. Only few methods do not rely on this assumption [4, 16].

• Same solid regions: Most of methods consider only rearrangements that cause no change in the content, only in the order. Currently there are few methods that model duplications and indels (insertions and deletions). We briefly talk about them in Section 2.5.

(21)

1 2 1 2 2 1 1 2 0.333.. 0.333... 0.333... 0.333... 0.333.. 0.333.. 0.333.. 0.333.. 0.333.. inversion of solid regions 1 and 2 inversion of solid regions 1 and 2 inversion of solid regions 1 and 2 inversion of solid regions 1 and 2 inversion of solid region 1 inversion of solid region 2 inversion of solid region 2 inversion of solid region 1

Figure 2.4: Example of Markovian model of genome evolution by inversions. Genomes are represented as arrangements of solid regions, and the states are all possible arrangements of solid regions. For each state, there are three possible inversions: either the solid region 1 is inverted, or the solid region 2 is inverted, or both solid regions are inverted and no effect is observed, i.e., the same genome is obtained. This model is based on an assumption that all inversions are equally probable to occur in a genome.

• Uniform distribution of operations: Most of models assume that all rearrangements of a certain type have the same probability to occur, as exemplified in Figure 2.4. However, in Chapter 5 we present a model of evolution by inversions where breakage probabilities vary across fragile regions and over time.

• Markov chain variations: Here we exemplify with a finite-state discrete-time Markov chain, but this is only one of the possible ways to model it. Some models are continuous-time, such as the one studied by Berestycki and Durrett [10], whereas others have infinite-states, like the model used by York et al. [98], as well as the model presented in Chapter 5.

Markovian models from better-known fields have been used to derive estimates for the context of evolution by rearrangements. In the sequel we briefly introduce three of these models and their relationship to genome evolution by rearrangements, which are seen in details in Chapter 3.

2.3.1 Evolution in molecular sequences

A lot of effort has been put into the modelization of the evolution of DNA or protein sequences [24, 44]. Many different Markov models of sequence evolution have been pro-posed in the literature and applied to sequence data. Caprara and Lancia [22] applied the same reasoning used in evolutionary models on sequences to a model of evolution by unsigned inversions, that is, a model in which the solid regions do not have an orientation. Later, other rearrangement models based on molecular evolution also enabled to obtain estimates easy to be computable [14, 93].

(22)

The evolution of a DNA sequence of length n is represented by n Markovian indepen-dent processes, iindepen-dentically distributed and constant in time. Each of these processes takes values on {A, C, G, T }, the four distinct nucleotides (or bases) that compose the DNA sequence. The DNA sequence evolves in time by the process of base replacement, where the transition probability defines the probability of replacing a nucleotide by another one. The first proposed Markov model of DNA sequence evolution, called the Jukes-Cantor model [53], assumed a constant rate for every possible change. A Jukes-Cantor-like model on a binary alphabet can be defined for our context: each site corresponds to a possible adjacency, with a 1 in a sequence if the corresponding adjacency is present in the associ-ated genome, and a 0 otherwise. The Markov chain has as state space all possible such sequences, with equiprobable substitutions at any site.

The difference between these processes is that rearrangements do not evolve the se-quence site per site, but performing substitutions at several sites simultaneously. For instance, 2-break operations, such as inversions, translocations, or DCJs, break two adja-cencies and rearrange them in a new way, affecting 4 sites at the same time: two states change from 1 to 0, whereas two other states change from 0 to 1.

2.3.2 Evolution in the symmetric group

Eriksen and Hultman [35] proposed, as an analogy to genomes evolving by inversions, a Markov chain on the symmetric group, where permutations evolve by random trans-positions. Notice that here the term “transposition” is not related to the rearrangement operation, but it is a permutation that swaps two elements. The analogy was also noted in the definition of an algebraic model for genome rearrangements [40].

A permutation α is a bijection of a given set E into itself, that is, α : E → E, and defines one or more orbits. A orbit of α in a ∈ E is a subset of E such that:

orbαa = {a, α(a), α2(a), . . . } = {αn(a)|n ∈ Z}

where α2(a) is the same as α(α(a)). The identity permutation is the permutation which maps each element into itself, and it has as orbits all subsets of E with only one element. When α is a permutation of a finite set E, we can visualize the orbits of α using cycles. A permutation is a cycle if it has exactly one orbit containing two elements or more. The transposition, mentioned at the beginning of this section, is a cycle with 2 elements in its biggest orbit.

Every permutation of a finite set is a composition of disjoint cycles, that is, an element appears in the biggest orbit of exactly one cycle. The effect of a transposition on the composition of disjoint cycles is the following: if the elements belong to the same cycle, then this cycle is broken in two disjoint cycles, whereas if the elements are in separated cycles, the two cycles are joined into one.

In the genome evolution context, inversions affect the components of the breakpoint graph in a similar way. Then, an analogy can be stated by using the identity permutation as the genome that is the starting point, and applying successive transpositions to the identity permutation, as inversions are applied to the genome. More precisely, the cycles

(23)

of the breakpoint graph of two genomes are identified with cycles of the permutation obtained from the identity by a series of k transpositions.

2.3.3 Evolution in random graphs

In their famous paper called “On the evolution of random graphs”, Erdős and Rényi [31] mention applications of their framework to communication networks, but surprisingly not to evolution as their title would have suggested. The link to rearrangements has been first pointed out by Berestycki and Durrett [10], linking genomes evolving by inversions with a widely studied subject in graph theory.

The popular Erdős and Rényi model [31] for generating random graphs is a process that starts with n vertices and no edges, and at each step adds one new edge chosen uniformly from the set of missing edges. The analogy can be stated with 2-break operations, such as inversions, translocations, or DCJs, where adjacencies of the initial genome are identified with vertices in the empty graph, and a rearrangement affecting two adjacencies relates to an edge connecting the corresponding vertices. Note that parallel edges are allowed, so the model is not exactly Erdős and Rényi’s, but most parameters evolve in the same way.

When a vertex is connected to more than one vertex, it means that the adjacency is being re-used in the evolutionary process. In fact, there are interesting statistical properties of random graphs that shed light on questions of rearrangements studies. For instance, the moment when the emergence of cycles in random graphs is highly likely is related to the moment when genome evolution probably applies rearrangements that undo the effect of previous ones; also the appearance of giant components in random graphs is related to the saturation of the genome evolutionary process, as after a high number of rearrangements small cycles are joined into one big cycle in the breakpoint graph.

2.4 Method of moment estimators for genome

evolu-tion

On biological data, we have only the observable outcomes of the process of evolution, and the number of evolutionary events is unknown. Then, based on a stochastic model of the evolutionary process, we can derive statistical estimators for the number of events applied. Three of the most popular methods of estimation are the method of moments, the method of maximum likelihood, and the Bayesian method. In this thesis we restrict our attention to estimators based on the method of moments, despite other approaches are also possible and discussed further in Section 2.5.

First part of estimations by method of moments relies on computing an expected value for a parameter D after a fixed number k of rearrangements applied to a genome. In other words, D is a real-valued random variable whose distribution is parametrized by k; as a result, the expected value E[D] is a function f of k. Not all methods in literature are able to compute the expected value E[D] exactly: They can provide either heuristic, or approximated, or exact solutions, as we will see in Section 2.4.2.

(24)

In our context, the parameter D is any measure computed from the two given genomes, such as the number of breakpoints (B) or the number of cycles (C) in the breakpoint graph. We discuss how different measures can affect an estimate in Section 2.4.1.

The rearrangements and their respective probabilities of occurrence are defined ac-cording to the model of evolution chosen. The model can consider only one kind of event or a combination of them, and it can be imported from better known fields, as seen in Section 2.3.

A method of moments estimator of the parameter D is an estimator ˆk = f−1(d), where d is the observed value for the mean of D. In our problem, since there is only one observation for D, the mean of D is simply the observed value for D. The function f (k) can be inverted algebraically or numerically, and the way the function is inverted directly impacts on its performance, making it computationally faster or slower, as discussed in Section 2.4.3.

As the parameter D is related to the observed differences of the genomes, its do-main of values is somehow bounded according to the genome size, that is, the number of solid regions, whereas the value of k can be arbitrarily large. Thus, the estimates of k underestimate the expected value, by an amount that grows quickly as k grows. More precisely, considering genomes with n solid regions, there are three separate regimes in the behaviour of a statistical estimator [10, 14]:

1. k ≤ n₂: when the number of events is low compared to the genome size, it is expected that k corresponds to the length of the most parsimonious scenarios to transform the initial genome into the final one;

2. n₂ < k ∈ O(n log n): during this phase it is expected that the evolutionary process takes a convoluted path to its endstate, possibly even undoing earlier changes along the way;

3. k ∈ o(n log n): after such an amount of events the arrangement of solid regions is likely randomised [22]. This is called saturation, and it means that after this number of rearrangements, there will be no signal in breakpoints or in the breakpoint graph to retrieve any part of the evolutionary distance k. The problem is essentially insurmountable, as the variance of any estimate will be huge.

Figure 2.5 depicts the concepts of method of moments estimators here exposed, that are deepened in the further subsections, along with a review of the existing estimators based on this approach.

2.4.1 Observable parameter: Breakpoints and cycles

Given two genomes with n solid regions, the observable parameter D is any measure that can be computed from the observable differences in the arrangement of solid regions.

Most estimators use parameters related to the number of 2-cycles in the breakpoint graph, such as the number of common adjacencies or the number of breakpoints, to give an estimate of the number of events applied, as in the pioneering work of Caprara and Lancia (CL) [22]. CL uses the number of breakpoints to estimate the real number of unsigned

(25)

1 2 3 4 2 3 4 1 2 1 3 4 1 2 3 4 2 1 3 4 1 2 3 4 1 2 3 4 2 1 3 4 2 4 3 1 2 3 4 1 2 4 3 1

Number of evolutionary events (k=4)

Problem:

Observable Parameter (D): Unknown evolutionary scenario

Model of evolution

A simpliﬁed view of how rearrangements take place

Method of moment estimator

1) E[D] = f(k)

Given a model of evolution, computes the expected value of an observable parameter D after k evolutionary events.

Number of cycles (C) or number of breakpoints (B)

Estimate the number of evolutionary events (inversions) separating two extant genomes

1 2 3 4 2 1 3 4 1 2 3 4 2 4 3 1 1 4 3 2 2 3 4 1 1 2 3 4 5 6 1 2 3 4 2 1 3 4 1 2 3 4 2 4 3 1 1 4 3 2 2 3 4 1 1 2 3 4 5 6 ... ... all arrangements of solid regions all arrangements of solid regions 0.0 0.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.0 total number of inversions 5 2

( )

= 10 1 10

assume that all inversions have the same probability to be chosen 1 inversion leads state i to state j

inversions break the genome in 2 places

total number of possible places to cut

E[D] can be computed either

heuristically, or approximately, or

exactly.

2) k = f-1_(k)

Inverts the function f(k) to obtain an estimate of k based on the value of the observable parameter. k can be computed from f(k) either

analytically, numerically, or recursively. ^ Estimated number of evolutionary events k=2 ^ ^

(26)

reversals applied to a genome. Wang and Warnow used the CL idea to propose Approx-IEBP (the Inverse of the Expected number of BreakPoints) [93], an approximation of the expected number of breakpoints after k signed inversions. More recently, Lin and Moret (LM) [58] proposed a heuristic for the expected amount of common/different adjacencies, and also the amount of common/different telomeres after a certain number of Double-Cut-and-Join (DCJ) operations [96].

Despite the simplicity of computing measures based on 2-cycles, they are affected by how genomes are segmented into solid regions. For instance, when genes are identified as solid regions, sequences of genes appearing in the same order may be collapsed in a single element, modifying the proportions of common and different adjacencies with respect to the total number of adjacencies.

As more components of the breakpoint graph are taken into consideration, less sus-ceptible to the resolution problem the measure is, since collapsing common adjacencies changes only the number of 2-cycles. For instance, some distances, such as the inversion distance (ID) [47] and the DCJ distance [96] (DCJ), are based on the total number of cycles in the breakpoint graph. However, it seems more complicated to use such measures as the observable parameter. Regarding the genome evolution by inversions, we found only a handful of statistical estimators using the total number of cycles as the observable parameter, but none of them compute exactly its expected value after a certain number of random inversions: Eriksen and Hultman (EH) [35] compute an approximation through an exact formula for the expected Cayley distance after k swaps on the symmetric group, and Berestycki and Durrett (BD) [10] compute an approximation using random graphs. Concerning the genome evolution by adjacent transpositions, Eriksson et al. [36] provided a closed formula for the expected inversion distance, later improved by Eriksen [33], but still costly to compute.

An idea recently proposed by Alexeev and Alexeyev [3] tries not only to avoid the resolution problem, but also infers the correct number of solid and fragile regions. Their estimator (here denoted AA) takes more information into account than the number of breakpoints, and assumes that n is unknown. In addition to the number of breakpoints B, the number of cycles with four vertices C2 (also referred to as squares of adjacencies),

and the number of cycles with six vertices C3 are used to estimate the number of 2-breaks

(k2), 3-breaks (k3), and the number of solid regions (n): Estimator AA—Alexeev and Alexeyev [3]

n − E[B] ≈ eγn (AA) E[C2] ≈ e−2γk2 E[C3] ≈ e−3γ k3+ 2k2 2 n , where γ = 2k2+ 3k3 n .

Besides AA, ER2 is another estimator which considers the number of solid regions unknown, and it is explained in Chapter 5.

(27)

Despite the intuition that breakpoint graphs carry more information than breakpoints, we observed in our experiments (see Chapter 3) that estimators based on 2-cycles had the same performance as estimators based on the number of cycles, in agreement with the experimental results obtained by Eriksen and Hultman [35]. However, this observation does not seem extensible to other rearrangement problems: in the median problem, there are experimental studies pointing out the opposite behaviour [67].

2.4.2 Estimation quality: Exact, approximated, and heuristic

so-lutions

The definition of “quality” applied here is suitable only for estimators based on the method of moments. In this specific context, an exact estimator computes precisely the expected value of a observable parameter D after a certain number of operations k, whereas an approximated estimator provides some guarantee about how close the solution is from the expectation. Otherwise, if there are no guarantees, the estimator is heuristic. Approxi-mated and heuristic solutions can be useful either when the evolutionary model considered is quite complicated or when efficiency is more important than accuracy.

Exact approaches

Using the analogy from molecular sequence evolution, a genome with n regions, oriented or not, is seen as a binary sequence S with n sites (or n+1 if the genome is linear). At each step one or more substitutions occur, that is, if at one site there is a 1, the substitution turns it into 0, and conversely, with all substitutions equiprobable.

Inspired by molecular evolution, the idea here is looking at rearrangements in a local level, first finding the expectation for a breakpoint between a pair of consecutive solid regions in a genome, and then summing up these expectations to obtain an expectation of the total number of breakpoints. Let Sk be a sequence obtained after k steps of such

process. The probability Pk that one site is different in S and Sk can be computed from

Pk−1 by

Pk = Pk−1qu+ (1 − Pk−1)ps = Pk−1(qu− ps) + ps, (2.1)

where ps is the probability to change a site given that it is the same in S and Sk, and qu

is the probability of not changing back a site when it is different. It is possible to solve the recurrence in order to obtain a closed formula depending on ps and qu. Regardless

the model, since P0 = 0, we have

Pk= Pk−1(qu− ps) + ps= ps

(qu− ps)k− 1

qu− ps− 1

. (2.2)

Notice that the probability of observing a breakpoint is the same, regardless of which adjacency is analysed. Thus, the expected number Dk of sites that have a different value

in S and Sk can be computed by the formula

(28)

since the probabilities of changing sites are independent from each other.

The idea above provides the foundations of several estimators. Caprara and Lancia (CL) [22] derived an exact formula for the evolution by inversions in the case where solid regions do not have an orientation. Setting ps = (n − 2)/ n₂, as n − 2 inversions can

separate an adjacency, and qu = 1 − 2/ n₂, as only two inversions undo a breakpoint, the

expected number of breakpoints after k unsigned inversions is

Estimator CL—Caprara and Lancia [22]

E[Bk] = (n − 1) 1 −

n − 3 n − 1

k!

. (CL)

Other estimators (Approx-IEBP [93], gDCJ [14]) also use equations 2.1–2.3 to obtain closed formulas for different rearrangement models, varying only the probabilities ps and

qu according to the operations applied.

Approximated approaches

Concerning the problem of estimating the expected number of breakpoints after a certain number of inversions, a complicating factor is the orientation of solid regions. Unlike CL, most estimators tackle the harder version, in which inversions affect both order and reading direction. Although a few exact estimates have been proposed, they all lack a closed formula.

Here we see an approximated approach for the oriented version, called Approx-IEBP [93]. Based upon a sequence-like model, Approx-IEBP uses equations 2.1–2.3 described earlier: The probability ps of creating a breakpoint at a certain position by an inversion can be

easily determined, whereas the probability qu of not undoing a breakpoint at a certain

position by an inversion depends on the orientation of the two solid regions.

Considering that two consecutive regions in the initial genome have the same orien-tation, if both regions have different orientations and are separated, there is exactly one inversion to undo the breakpoint; otherwise, if they are separated but with the same orientation, it is impossible to undo the breakpoint with one inversion only. Thus in the worst case, where it is always possible to undo breakpoints, the probability is equal to 0, and in the best case, where it is impossible to undo breakpoints by inversions, the probability is equal to 1 over the total number of inversions. The average between the lower and upper bounds enables one to derive an approximation for the probability of a certain adjacency being a breakpoint after k inversions. The error of Approx-IEBP is the sum of the errors for each adjacency. Notice that CL’s assumption that the solid regions do not have orientations simplifies the computation of both ps and qu.

Heuristic approaches

Heuristic estimators do not provide any theoretical guarantees about accuracy. Nonethe-less, they may exhibit good computational performance in experimental studies. One of these approaches is EDE (Empirically Derived Estimator) [66], for which the idea is

(29)

basically finding a function to fit the normalized curve between the number of inversions performed and the number of inversions observed. The normalized curve was obtained through simulations of the genome evolution by inversions, and the normalized curve was similar regardless the size of the genome (37 or 120 genes).

Another heuristic estimator was proposed by Lin and Moret [58], based on an evo-lutionary model where one DCJ is performed at a time, and all possible mk _operations

at step k have the same probability to be chosen. The heuristic method computes the expected value of four related observable parameters: the amount of common (sk

A) and

different (dk_A) adjacencies, and also the amount of common (sk_T) and different telomeres (sk_T), after k DCJs.

Given a genome with nA adjacencies and nT telomeres, the expected values of the

observable parameters after k DCJs are computed as follows: Estimator LM—Lin and Moret [58]

E[skA] = E[s k−1 A ] +

1

E[mk−1_](nA− 2E[s k−1 A ](E[A k−1_{] + E[T}k−1_])) _(LM) E[skT] = E[s k−1 T ] + 1 E[mk−1_](nT(E[T k−1_{] + 1) − 2E[s}k−1 T ](E[A k−1_{] + E[T}k−1_])) E[dk_A] = E[dk−1_A ] + 1 E[mk−1_](2E[s k−1 A ](E[A k−1_{] + E[T}k−1_{]) +}E[Tk−1] 2 − E[Ak−1] − nA) E[dkT] = E[dk−1T ] + 1 E[mk−1_](2E[s k−1 T ](E[A

k−1_{] + E[T}k−1_{]) − nT}_(E[Tk−1_{] + 1) − 2}E[Tk−1] 2

+ 2E[Ak−1]),

where the total number of operations mk depends on the number of adjacencies Ak=

sk_A+ dk_Aand the number of telomeres Tk = sk_T + dk_T. LM does not give the expected values after k DCJs, neither any guarantees on the quality of the estimation, due to the fact that the formulas use the expected values from previous steps, which is not trivial to be determined since the random variables are not independent. In Chapter 3 we present an exact algebraic formula for an evolutionary model by DCJs.

2.4.3 Running time: Recursive, numerically invertible, and

ana-lytically invertible formulas

After obtaining a function f (k) to compute the expected value of an observable parameter D after k events, this function needs to be inverted in order to obtain an estimate ˆk for an observation of D.

However, the function f (k) is not always easily invertible. In terms of computational effort, an estimate is computed either recursively, or through a numerically invertible formula, or in the best case using an analytical formula. Here our review starts from the less efficient approaches to the most efficient ones; it does not necessarily mean that efficient approaches are better; in most cases there is a trade-off between an accurate solution and a fast method.

Recursive formula

In certain cases the estimators cannot be expressed as a closed formula; instead, they are computed in a recursive way, that is, to compute the estimate ˆk, all estimates between 1

(30)

and k − 1 should be computed.

Besides LM, another example of recursive formula is the one proposed by Wang and Warnow. Wang and Warnow came up with an exact estimator, called Exact-IEBP [92], for the case where solid regions have an orientation and evolve by inversions. However, the formula is computationally costly compared to other solutions for the same problem. Estimators based on sequence evolution usually define the probabilities of creating and undoing breakpoints at a certain position (ps and qu), but as we saw in Section 2.4.2, it is

hard to define the probability for undoing a breakpoint in case of signed inversions, since it depends on the orientation of solid regions in the given genomes.

In order to obtain exact estimates, something more detailed is needed in place of ps

and qu. Wang and Warnow replaced the parameters of having or not breakpoints by the

parameter L(G0) ∈ {±1, ±2, . . . , ±(n − 1)}, which defines the distance in the genome G0 of two consecutive regions in the initial genome G, and the sign (positive or negative) indicates the reading direction. Then, L(G0) = x means that there are |x| − 1 solid regions in G0 separating a certain pair of solid regions which are consecutive in G: L(G0) = 1 is a common adjacency of G and G0, whereas any other value is a breakpoint, supposing that initially the two consecutive regions have the same orientation.

A Markov chain is defined for the 2n − 2 states of L(G0). The transition between two states occurs according to the proportion of events whose effect is to create the ending state from the initial state. Exact-IEBP uses the Generalized Nadeau-Taylor model [93], that is, besides inversions, it is also possible to apply transpositions, and inverted transpositions. Each type of rearrangement has a probability associated, and after choosing the type of event, all operations of this type have the same chance of being performed.

Given the transition matrix M , where M [u, v] is the probability of a rearrangement changes L(G0) from u to v, the expected number of breakpoints after k inversions is

Estimator Exact-IEBP—Wang and Warnow [93]

E[Bk] = nP r(L(Gk) 6= 1) = n(1 − Mk[1, 1]). (Exact-IEBP)

In addition to the difficulty in inverting the function, Exact-IEBP requires the com-putation of Mk, which is costly and intractable as the genome size increases.

Numerically invertible formula

Not all closed formulas can be easily computed; in some difficult cases, they can be solved numerically using standard techniques such as Newton-Raphson method.

Interesting examples can be taken from Eriksen and Hultman [35], and Berestycki and Durrett [10], already mentioned in Section 2.3. These estimators have a closed formula for the number of cycles in a permutation after k transpositions, which is closely related to the number of cycles in a breakpoint graph after k inversions.

Eriksen and Hultman (EH) [35] proposed an exact formula for the expected number of cycles in a permutation obtained from the identity permutation of size n by a series of k random transpositions:

(31)

Estimator EH—Eriksen and Hultman [35] E[Ck] = n − n X i=1 1 i + n−1 X p=1 min{p,n−p} X q=1 apq p 2 + q−1 2 − n−p−q+2 2 n 2 !k , (EH) where apq = (−1)n−p−q+1 (p − q + 1)2 (n − q + 1)2_{(n − p)} n − p − 1 q − 1 n p .

Berestycki and Durrett (BD) [10] use an analogy with the evolution of random graphs to derive an approximation for the number of cycles in a permutation of n elements after k transpositions through the formula

Estimator BD—Berestycki and Durrett [10]

E[Ck] = ∞ X i=1 n 2k ii−2 i! 2k n e −2k n i . (BD)

Observe that, in the permutation context, EH gives an exact solution whereas BD gives an approximation, bounded by an error of O(√n). However, in the genome evolu-tion context, both compute an approximaevolu-tion for the expected number of cycles. These approximations are formalized in Chapter 3.

Analytically invertible formula

Some estimators ˆk = f−1(d) can be computed directly. They are given by simple closed formulas with running time O(1). This is the case of an approximation given by Erik-sen [32], as an extension of Exact-IEBP.

Eriksen observed that the transition matrix M , used in Exact-IEBP for the process of evolution by inversions, has the important property of being symmetric. A symmetric matrix can be diagonalised and its eigenvalues λi and eigenvectors vi are particularly well

behaved. Then, Eriksen wrote the exact equation of Exact-IEBP in terms of eigenvectors and eigenvalues. After that, he found the two largest eigenvalues and their respective eigenvectors, and plugging just these two largest terms in the formula, he derived a fast closed formula which is an approximation for the problem:

Estimator NE—Eriksen [32] E[Bk] = n 1 − P2n−2 i=1 v 2 iλki n 2 k ! ≈ n 1 − 1 2n − 2 1 − 1 − 2 n k! . (NE)