Methods for phylogenetic reconstruction = Métodos de reconstrução filogenética

(1)

INSTITUTO DE COMPUTAÇÃO

João Paulo Pereira Zanetti

Methods for Phylogenetic Reconstruction

Métodos de Reconstrução Filogenética

CAMPINAS

2016

(2)

Methods for Phylogenetic Reconstruction

Métodos de Reconstrução Filogenética

Tese apresentada ao Instituto de Computação da Universidade Estadual de Campinas como parte dos requisitos para a obtenção do título de Doutor em Ciência da Computação.

Dissertation presented to the Institute of Computing of the University of Campinas in partial fulllment of the requirements for the degree of Doctor in Computer Science.

Supervisor/Orientador: Prof. Dr. João Meidanis

Este exemplar corresponde à versão nal da Tese defendida por João Paulo Pereira Zanetti e orientada pelo Prof. Dr. João Meidanis.

CAMPINAS

2016

(3)

Ficha catalográfica

Universidade Estadual de Campinas

Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467

Zanetti, João Paulo Pereira,

Z16m ZanMethods for phylogenetic reconstruction / João Paulo Pereira Zanetti. – Campinas, SP : [s.n.], 2016.

ZanOrientador: João Meidanis.

ZanTese (doutorado) – Universidade Estadual de Campinas, Instituto de Computação.

Zan1. Algoritmos. 2. Biologia computacional. 3. Programação dinâmica. 4. Árvore PQR. I. Meidanis, João,1960-. II. Universidade Estadual de Campinas. Instituto de Computação. III. Título.

Informações para Biblioteca Digital

Título em outro idioma: Métodos de reconstrução filogenética Palavras-chave em inglês:

Algorithms

Computational biology Dynamic programming PQR trees

Área de concentração: Ciência da Computação Titulação: Doutor em Ciência da Computação Banca examinadora:

João Meidanis [Orientador]

Maria Emília Machado Telles Walter Cleber Valgas Gomes Mira

Zanoni Dias

Ulisses Martins Dias

Data de defesa: 18-07-2016

Programa de Pós-Graduação: Ciência da Computação

(4)

COMPUTAÇÃO

João Paulo Pereira Zanetti

Methods for Phylogenetic Reconstruction

Métodos de Reconstrução Filogenética

Banca Examinadora: • Prof. Dr. João Meidanis

IC / Unicamp

• Profa. Dra. Maria Emilia Machado Telles Walter CiC / UnB

• Prof. Dr. Cleber Valgas Gomes Mira UEMS

• Prof. Dr. Zanoni Dias IC / Unicamp

• Prof. Dr. Ulisses Martins Dias IC / Unicamp

A ata da defesa com as respectivas assinaturas dos membros da banca encontra-se no processo de vida acadêmica do aluno.

(5)

Acknowledgements

Many people helped make this thesis become reality, and I want to gratefully acknowledge all of them here.

I am very grateful for my parents, Janete and João, who always supported and en-couraged me, not only during this research, but through my entire life.

Many thanks also to my partner Luísa and to my friends either in Brazil, Canada, or elsewhere. Each one of you who walked by my side during this journey is responsible for some of this thesis.

I would like to express my gratitude towards my advisor, João Meidanis, for all the knowledge, wisdom, guidance, and advice provided to me. Also, to Priscila Biller, with whom I shared most of the struggles and joys of this work.

I would like to thank Cedric Chauve, for welcoming me to a new country and to a new research group, together with Ashok Rajaraman and Yann Ponty, who also worked beside me.

I am deeply indebted to the University of Campinas and to Simon Fraser University, and to all the professors and sta I met in both of them. Both universities are great places to study and work, and they inspired me to grow a lot as a person and as a researcher.

Finally, I would like to thank the funding agencies CAPES and FAPESP (processes 2012/13865-7 and 2013/07868-6) for the nancial support.

(6)

Reconstrução logenética é o campo da Biologia Computacional dedicado a inferir um histórico evolutivo a partir de dados biológicos do presente. Nesta tese de doutorado, apresentamos três trabalhos sobre diferentes problemas relacionados ao tema.

O primeiro visa reconstruir o ancestral comum a três genomas de entrada. Apesar da entrada aparentemente limitada, este problema pode ser aplicado para determinar ancestrais em árvores de topologia já conhecida. Nosso trabalho introduz uma forma de representar genomas através de matrizes. A partir disto, denimos o problema da mediana de matrizes, que, dadas três matrizes A, B e C, consiste em encontrar uma matriz M que minimize a soma d(A, M) + d(B, M) + d(C, M), onde d(X, Y ) é o posto da matriz Y − X. Para resolver tal problema, apresentamos um algoritmo aproximado que garante resultados no mínimo tão bons quanto um dos genomas de entrada, mas que, em experimentos simulando a evolução de genomas, teve resultados muito mais próximos do ótimo. Além disso, mostramos também uma heurística que traduz uma matriz qualquer, como a resultante do nosso algoritmo, de volta para um genoma.

No segundo, o objetivo é reconstruir genomas ancestrais a partir dos históricos evoluti-vos de suas famílias de genes. Para isto, estendemos o esquema de programação dinâmica do software DeCo, que constrói orestas retratando a evolução de adjacências. Nossa implementação, em vez de escolher apenas as adjacências de uma solução mais parcimo-niosa, explora todo o espaço de soluções, escolhendo adjacências em função da sua maior frequência. Com esta abordagem, conseguimos reduzir signicativamente o número de inconsistências ao avaliar múltiplas instâncias sobre os mesmos genes.

Finalmente, discutimos uma ferramenta capaz de auxiliar na reconstrução de genomas ancestrais, considerando os genomas de espécies descendentes e sem a necessidade dos históricos de todas as famílias de genes. Esta ferramenta é a estrutura de dados chamada árvore PQR, que não só detecta se a entrada tem a propriedade dos uns consecutivos, como também indica os obstáculos que possivelmente impedem a instância de ter a pro-priedade. Isto é importante porque, ao reconstruir genomas ancestrais, erros na entrada são muito comuns e podem muito facilmente impedir a instância de ter a propriedade dos uns consecutivos. Aqui, consolidamos o conhecimento sobre os algoritmos online e oine para árvores PQR em um único artigo.

Com isto, avançamos em três diferentes direções o conhecimento nas áreas de Recons-trução Filogenética e de Biologia Computacional.

(7)

Abstract

Phylogenetic reconstruction is the eld of computational biology dedicated to infer evo-lutionary history from given extant biologic data. In this PhD thesis, we present three works on dierent issues related to the topic.

The rst aims to reconstruct the common ancestor to three input genomes. Despite the apparently limited input, this problem can be applied to determine ancestors in trees whose topology is already known. Our work introduces a way to represent genomes using matrices. From this, we dene the matrix median problem, which consists of, given three matrices A, B and C, nd a matrix M that minimizes the sum d(A, M) + d(B, M) + d(C, M ), where d(X, Y ) is the rank of the matrix Y − X. To solve this problem, we present an approximate algorithm that ensures results at least as good as one of the input genomes, but that, in the experiments simulating the evolution of genomes, got much closer to optimal results. Furthermore, we also show a heuristic that translates any matrix, like the ones returned by our algorithm, back to a genome.

In the second, the goal is to reconstruct ancestral genomes from the evolutionary histo-ries of their respective gene families. For this, we extend the dynamic programming scheme of DeCo, that builds forests depicting the evolution of adjacencies. Our implementation, instead of choosing just the adjacencies of one most parsimonious solution, explores the entire solution space, choosing adjacencies in function of their larger frequency. Using this approach, we managed signicantly reduce the number of inconsistencies while evaluating multiple instances on the same genes.

Finally, we discuss a tool to assist in the reconstruction of ancient genomes, considering the genomes of descendant species, without the need for the histories of all the gene families. This tool is the data structure called PQR-tree, which not only detects whether the entry has the consecutive ones property, but also the obstacles that possibly prevent the instance from having said property. This is important because, while reconstructing ancient genomes, entry errors are very common and can very easily prevent the instance from having the consecutive ones property. Here, we consolidate the knowledge on the online and oine algorithms for PQR trees in a single paper.

With these, we moved knowledge in phylogenetic reconstruction and computational biology forward in three dierent directions.

(8)

1 Introduction 10

1.1 Reconstruction in terms of medians . . . 11

1.2 Reconstructing genome adjacencies . . . 13

1.3 Reconstructing ancient genomes . . . 14

2 The Matrix Median Problem 16 2.1 Introduction . . . 17

2.2 Linear algebra background . . . 18

2.2.1 Vector spaces . . . 18

2.2.2 Matrices . . . 19

2.2.3 Permutations . . . 19

2.3 The Matrix Median Problem . . . 20

2.3.1 From genomes to matrices . . . 20

2.3.2 Matrix Distance . . . 21

2.3.3 Relationship to known distances . . . 22

2.3.4 Matrix Median Problem . . . 25

2.3.5 Partitioning Rn _{. . . 27}

2.3.6 Computing Median Candidates . . . 30

2.3.7 Approximation Factor . . . 33

2.4 Matrix to genome heuristic . . . 36

2.5 Experimental results . . . 36

2.5.1 Implementation . . . 37

2.5.2 Validation tests . . . 37

2.5.3 Simulated evolution tests . . . 38

2.6 Conclusions . . . 42

3 Evolution of genes neighborhood 45 3.1 Background . . . 46

3.2 Methods . . . 47

3.2.1 Models . . . 47

3.2.2 Algorithms . . . 49

3.3 Results and discussion . . . 54

3.4 Conclusions . . . 56

4 PQR-trees 57 4.1 Introduction . . . 58

4.2 PQ-trees and PQR-trees . . . 60

4.3 PQR-tree reduction . . . 63

(9)

4.5.1 Overlap graph . . . 79

4.5.2 Twin classes and node types . . . 79

4.5.3 Building the tree . . . 81

4.6 Conclusion . . . 81

5 Discussion and conclusions 82 5.1 Future work . . . 82

(10)

Introduction

We give the name phylogenetic reconstruction to the inference of evolutionary history from extant information. This task usually results in a representation of evolutionary history in the form of a phylogenetic tree, whose nodes represent biological entities, most commonly species, genes, or genomes. The branches (edges) represent the relationship of immediate ancestry between two entities, and these branches can have associated lengths, usually related to the time elapsed, or to the number of mutations that occurred. The phylogenetic tree depicts the process of evolution from a common ancestor to all the extant entities positioned at the leaves. An example is illustrated in Figure 1.1.

Phylogenetic trees are an oversimplication of the evolutionary history and its under-lying biological mechanisms. It is impossible to depict in one gure all the complexity of the evolutionary process. Many phenomena that today's science ascribes to the evolution of life forms can be very hard to represent in a phylogenetic tree or very dicult to infer exclusively from extant data. We give a few examples below.

Convergent evolution can disturb the reconstruction of the evolution scenario. A trait can evolve in parallel in two separate branches of the real tree, and suggest that the two branches are more closely related than they really are.

Lateral (or horizontal) gene transfer is the transfer of genes that does not happen through reproduction. For instance, genes carried by vectors such as viruses or bacteria. In a phylogenetic tree, this means that genetic information moves laterally from one branch to another, independent branch. This specic transfer phenomenon is widely studied, and taking it into account can make a computationally easy problem become

Figure 1.1: Example of a phylogenetic tree for primates.

(11)

much harder.

Another phenomenon that connects dierent branches of a tree is hybridization. A hybrid is an organism that comes from the breeding of two dierent species. That would lead to a node with two parents, impossible in a tree.

However, even though phylogenetic trees might be unable to represent some biological details of the evolution [71, 79], they are a valuable tool to approximate true evolution, and to study specic mechanisms [43,44,59].

Phylogenetic trees date from the 19th century, and one of their earlier mentions was in Darwin's works [9,10,34]. Through the years, the phylogenetic tree became the accepted way of representing evolution. In the 1960's, the rst computational methods to infer phylogenies from extant data were presented, for example theories proposed by Edwards and Cavalli-Sforza [39,40], and Camin and Sokal [20]. From then on, the eld of compu-tational phylogenetics rapidly expanded, especially with the advent of DNA sequencing, at the end of the 20th century. We can group the methods for phylogenetic inference into three main families.

One approach to phylogenetic reconstruction is based on distances. In this approach, there is some kind of metric dened as the distance between two entities. For example, a distance can be based on mismatches in the pairwise alignment of sequences, or it can be a rearrangement distance. Once all pairwise distances are dened and known, there are a variety of methods for inferring a tree such that entities separated by smaller distances are also closer in the tree.

Another way to reconstruct a potential tree is based on maximum parsimony. Here, the search is for a tree that minimizes the number of certain evolutionary events or, in a more general setting, minimizes a cost function.

A third approach to reconstruct a phylogeny is based on statistical techniques. These methods work by assigning a probability to each tree, and then using methods to search for a tree with the maximum probability (or likelihood).

This thesis comprises three articles related to problems in phylogenetic reconstruc-tion. The rest of this chapter goes into more detail about their backgrounds. Chapter 2 presents a way to reconstruct the common ancestor to three genomes in the distance-based paradigm. Chapters 3 and 4 deal with reconstructing ancestral genomes in a parsimony-based setting. Chapter 3 uses the evolutionary histories of multiple gene families to infer ancestral adjacencies. In Chapter 4, we discuss a data structure called PQR-tree, that can assist in the reconstruction of ancient genomes without the need for the history of each gene family. In Chapter 5 we discuss our conclusions and future works.

1.1 Reconstruction in terms of medians

Mutations in a genome can be divided in two major types, according to their size. The most commom is the substitution, insertion, or deletion of one or a few nucleotides at a point in the genome. These small-scale mutation events are caused by errors in DNA replication, are relatively common, and a single species can accumulate a large number of them.

(12)

A B C D E F G A B E F G AA B C D E F G A B C D E F G AA B L M N O H I J K L M N O H I J K C D E F G A A B C D E F G AA D E C F G A B-E-D-C F G B reversal transposition deletion

insertion fusion fission

A B C D E F GA B C D E F G A B C D E F G

translocation

Figure 1.2: Examples of genome rearragements on linear chromosomes.

The second and potentially more impactful kind of mutation involves the movement, duplication, or deletion of large segments of the genome at once. These are caused by errors in cell division. They are called genome rearrangements and usually have dramatic eects on the organism. Figure 1.2 illustrates examples of rearrangement events.

The analysis of genome rearrangements was pioneered by Dobzhansky and Sturtevant, who published in 1938 a rearrangement scenario using only inversions between Drosophila pseudoobscura and Drosophila miranda [36].

From then on, the eld evolved, and the study of these kind of rearrangement scenarios turned into combinatorial problems. In the distance problem, a set of allowed rearrange-ment events, called rearrangerearrange-ment model, is xed, and the goal is to compute the most parsimonious scenario between two given genomes, that is, how to mutate one into the other using only the events allowed by the model in the minumum number of evolutionary operations.

A large number of rearrangement models are studied and, in many of them, the pair-wise distance is easy to compute [42, 53, 82]. However, when applying these distances and rearrangement models to phylogenetic reconstruction problems with more than two genomes, they soon become intractable.

We focus on one of the seemingly simplest problems involving more than two genomes, the genome median problem (GMP). The genome median problem consists in nding a genome that minimizes the sum of its rearrangement distances to three input genomes. That is, we are given three genomes α, β, and γ as the input, and a metric d() associated with a rearrangement model, and we want to nd a genome µ that minimizes d(µ, α) + d(µ, β) + d(µ, γ). Even though it deals with only three input genomes, the GMP is NP-Hard under most rearrangement models [19,23,45,68].

The GMP is very useful because it can be applied to more general phylogenetic recon-struction problems, like the small phylogeny problem. In the small phylogeny problem, the input is a phylogenetic tree and the extant genomes (the leaves of the tree), and the goal is to reconstruct all the ancestral genomes (the internal nodes) so that the total number of events is minimized. Some methods for the small phylogeny problem repeatedly apply

(13)

the GMP until the tree converges [18,66,74].

Our method models genomes as matrices in a simple way. The distance between two genomes is then dened as the rank distance of their respective matrices, that is, given two genomes α and β, and their respective matrices A and B, the distance d(α, β) = d(A, B) is the rank of the matrix B − A. We show how this matrix model relates to two known rearrangement models, the algebraic distance [42], and Double-Cut-and-Join distance [82]. With the matrix model for genomes, we introduce the matrix median problem, which consists of, given three matrices A, B and C, nd a matrix M that minimizes the sum d(A, M ) + d(B, M ) + d(C, M ). We present an approximate algorithm for this problem. Our algorithm ensures results at least as good as one of the input genomes, however, in the experiments simulating genome evolution, it got results much closer to optimal. Note that the median candidates returned by the algorithm are not necessarily genomes. To translate them back to genomes, we show a heuristic based on a maximum matching, also with very good experimental results.

1.2 Reconstructing genome adjacencies

One of the classical problems in phylogenetic reconstruction is the gene tree reconcilia-tion [49]. In this problem, we are given two phylogenetic trees. One is the species tree S, which describes the evolutionary history of a group of extant species at the leaves, tracing back to a single ancestor at the root. The second is the gene tree G, which describes the evolution of a gene family, from a single ancestral gene, through a number of duplication and speciation events (not yet labeled on the tree), to the extant genes that appear in the extant species of S.

The goal is to t G in the evolutionary history told by S. This is done by labeling the nodes of G with the species where they occur and the kind of event they represent: duplication or speciation. Extra branches might also be inserted on the tree to represent gene losses. In the output reconciliated gene tree, the internal nodes represent ancestral genes and the leaves represent either the extant genes from the input or lost genes. The desired reconciliated tree is one that minimizes the number of duplications and/or losses. Bérard et al. expanded this parsimony-based approach from genes to adjacencies with the DeCo algorithm [13]. It takes as input two already reconciled gene trees, and a set of extant adjacencies. The output is an adjacency forest that minimizes the number of adjacency gains or breaks. The adjacency forest tells the evolutionary story of the ancestral adjacencies, through the aforementioned events of gain and break, but also of speciation, duplication, gene duplication, and gene loss.

Like most algorithms to reconstruct parsimonious evolutionary scenarios along a species tree, DeCo uses a dynamic-programming (DP) scheme to eciently compute a parsimo-nious adjacency forest for the given input. However, when computing adjacency scenarios for many pairs of gene trees along the same species, a large number of inconsistencies appear.

These inconsistencies are ancestral genes taking part in more than two adjacencies. This suggests that, although the forest returned by DeCo is optimal, it might not be the

(14)

best explanation for the evolution of the considered adjacencies.

Our approach addresses this issue by exploring the whole solution space of DeCo's DP scheme, instead of accepting one arbitrary solution. We perform this exploration using two techniques. The rst method is to sample a large number of co-optimal or sub-optimal scenarios at once, and accept only the adjacencies that appear the most. The second approach is to use an inside-outside algorithm to directly compute a probability for each adjacency, without the use of sampling.

The probability distribution used is the Boltzmann distribution, that allows us to control the sampling through the value chosen for a constant called kT . Lower values of kT skew the distribution towards the most parsimonious solutions, while increasing the value of kT makes the probabilities more uniform over all solutions, independently of their parsimony score.

Both approaches managed to lower the number of inconsistencies observed. Further-more, the best results were observed when the Boltzmann distribution was biased towards the co-optimal solutions, and the solutions have relatively low cost. This suggests that parsimony is indeed a relevant criterion for adjacency gains and breaks.

1.3 Reconstructing ancient genomes

Chauve and Tannier, in 2008, published a framework for reconstructing ancient genomes. The approach is centered around a property of binary matrices called the consecutive ones property (C1P) and the use of data structures called PQ and PQR-trees [25].

Their framework works by selecting groups of genome segments that appear consecu-tively in multiple descendants of the species whose genome we want to reconstruct. These are candidates for segments that also appear together in the ancient genome. Once the candidate groups are dened, the goal is to determine a sequence of segments where all the selected groups appear consecutively. This is an application for the C1P.

The C1P problem is a classic problem in combinatorics and has applications in many dierent areas of knowledge, for example recognizing interval graphs [46] and planar graphs [17], archeology [56] and data visualization [32]. It can be stated as follows.

Given a set U and a family S of subsets of U, the consecutive ones problem consists in nding a permutation of the elements of U such that every S ∈ S appears consecutively. In bioinformatics, one textbook application of the C1P is physical mapping [2, 26]. This is the problem of reconstructing a DNA strand given a collection of overlapping fragments of it called clones. First, a number of probes is identied in the genome. Then, clones are produced from this genome. These clones are contiguous fragments of the original genome which are then matched with the probes in a way that we know which probes are in each clone, but not their order.

From this point on, we have a standard C1P instance. The probes are the elements of U, and the sets of probes in each clone make up the constraints in S.

With the advances in technology in genome sequencing and assembly, this specic approach got outdated, but methods based on it are still used, in applications where the original DNA might not be available, like the ancient genomes found in the problem of

(15)

Chauve and Tannier.

In their framework, genomes are represented as sequences of markers. The reconstruc-tion occurs in three steps. The rst is to detect sets of markers that should be contiguous in the ancestral genome. The candidates for such groups are groups of markers that are contiguous in two or more extant species that descend from the considered ancestral in the species tree. These candidate groups are then further rened considering a synteny conservation model.

The second step takes the groups of contiguous markers from the rst step and uses them as the input instance of the C1P, building a PQR-tree.

The PQ-tree is a structure rst introduced by Booth and Lueker in 1976. It can be built eciently and compactly represents all the permissible permutations for its in-put [17]. Through its two types of nodes, P and Q, it is possible to produce all permissible permutations from just one tree.

The PQR-tree is a generalization of the PQ-tree that manages to have a meaningful output even when the input instance does not have a permissible permutation (the case where there is no PQ-tree). The PQR-tree introduces a third kind of node, of type R, that indicates the elements that are obstacles to the C1P [64].

Going back to Chauve and Tannier's framework, if the output tree T of the second step is a PQ-tree, then there are no ambiguities and T represents the possible sequences for the ancestral genome. However, if there is an R-node in the output, it is a sign that there are false positives in the groups of the rst step, and a third step tries to clear them. PQ and PQR-trees are also used in other applications for the C1P in phylogenetic reconstruction and in comparative genomics [1,12,77].

(16)

Median Approximations for Genomes

Modeled as Matrices

1 Abstract

The Genome Median Problem is an important problem in phylogenetic reconstruction under rearrangement models. It can be stated as follows: given three genomes, nd a fourth that minimizes the sum of the pairwise rearrangement distances between it and the three input genomes. In this paper, we model genomes as matrices, and study the matrix median problem using the rank distance. It is known that, for any metric distance, at least one of the corners is a 4

3-approximation of the median. Our results allow us to

compute up to three additional matrix median candidates, all of them with approximation ratios at least as good as the best corner, when the input matrices come from genomes. We also show a class of instances where our candidates are optimal. From the application point of view, it is usually more interesting to locate medians farther from the corners, and therefore these new candidates are potentially more useful. In addition to the ap-proximation algorithm, we suggest a heuristic to get a genome from an arbitrary square matrix. This is useful to translate the results of our median approximation algorithm back to genomes, and it has good results in our tests. To assess the relevance of our approach in the biological context, we ran simulated evolution tests and compared our solutions to those of an exact DCJ median solver. The results show that our method is capable of producing very good candidates.

1_{Based on João Paulo Pereira Zanetti, Priscila Biller, and João Meidanis. Median Approximations for}

Genomes Modeled as Matrices. Bulletin of Mathematical Biology, 2016. DOI: 10.1007/s11538-016-0162-4.

(17)

2.1 Introduction

Phylogenetic reconstruction involves obtaining a phylogenetic tree from extant informa-tion and inferring informainforma-tion about ancestral genomes. One of the fundamental concepts used to achieve this is a measure of the lengths of tree branches, or similarity between two genomes, through a distance metric.

Such metrics are usually easy to compute pairwise, with linear or subquadratic algo-rithms to determine the distance between two points. On the other hand, evolutionary scenarios involving more than two genomes are generally much more dicult.

One of the most basic problems in phylogeny reconstruction is the genome median problem (GMP): given three genomes, nd a fourth genome that minimizes the sum of its pairwise distances to the other three. The GMP is NP-Hard in most unichromosomal rearrangement models [19, 23, 45, 68]. Exceptions are found in the multichromosomal domain, where simple metrics such as the breakpoint distance [78] and the Single-Cut-or-Join (SCJ) model [41] result in polynomially solvable problems.

In the vast majority of the models where the GMP is NP-Hard, exact algorithms can still be pratical for smaller instances [4, 23, 81, 84], while heuristics are used for larger instances [18, 48, 72]. Also, many approximation algorithms have been developed [21, 22, 69].

The GMP is a particularly useful problem, because several algorithms for phylogeny re-construction are based on repeatedly solving GMP instances, until convergence is reached. Software tools such as the pioneering BPAnalysis [74], the more recent GRAPPA [66], and MGR [18] use this technique.

Here, we show a simple way to model genomes as matrices. With this, we dene the distance between two genomes as the rank distance of their respective matrices. We also show how the matrix model relates to known rearrangement models, e.g. the algebraic distance [42], and the well-known Double-Cut-and-Join (DCJ) distance [82].

The main goal of this chapter is to investigate the problem of computing the matrix median of three genomes. The median problem can be stated as follows: given three matrices A, B, and C, and a metric d(M, N) equal to the rank r(N − M), nd a genome M that minimizes the total score d(M; A, B, C), dened as

d(M ; A, B, C) = d(M, A) + d(M, B) + d(M, C).

We show that the matrix median problem can be approximated quickly. Although the solutions are not always genomes, we also show a heuristic to translate the matrix solutions back into genomes, using a maximum weight matching graph, with good experimental results. This positive result can help shed more light into the genome problem, by leading to approximation solutions, or to special cases that can be solved polynomially in the genome setting.

It is known that, for any distance satisfying the axioms of a metric, at least one of the corners is a 4

3-approximation of the median [75]. Our results allow us to compute up to

three additional matrix median candidates, all of them with approximation ratios at least as good as the best corner, when the input matrices come from genomes. Although the

(18)

3-approximation factor is tight, there is a class of input instances for which our candidates

are guaranteed to be actual medians, and on instances generated by simulating genome evolution our candidates are also much closer to the median. Also, in real applications, it is necessary to locate medians farther from the corners, otherwise they neither approximate the true ancestor nor carry information about all input genomes [52].

The rest of this chapter is organized as follows. In Section 2.2, we review some linear algebra denitions used in this work. In Section 2.3, we show how genomes can also be seen as matrices, dene the matrix median problem, show our results, and propose an algorithm. In Section 2.4, we present a heuristic to compute genomes from matrices like the ones we get from the approximation algorithm. In Section 2.5, we discuss the algorithm implementation and show experimental results. Finally, in Section 2.6, we present our conclusions.

This chapter is a revised version of the work presented at WABI 2013 [83].

2.2 Linear algebra background

In this section, we review concepts in linear algebra that are important for the matrix median approach we propose.

2.2.1 Vector spaces

A vector space is a set V with two operations: vector addition and scalar multiplication, satisfying the properties of commutativity, associativity, distributivity, additive identity and inverse, and multiplicative identity. An example of vector space, and the one that matters the most for us, is the n-dimensional real space Rn_{of all n-tuples of real numbers,}

with the addition and scalar multiplication of vectors dened componentwise.

We will consider the norm of a vector in Rn to be its Euclidean norm. That is, for

x = (x1, x2, . . . , xn), its norm kxk is the Euclidean distance from the origin to x, or

kxk =px2

1+ x22+ . . . + x2n.

For two vectors in the Rn_{, we say they are orthogonal to each other if their inner}

product equals zero. The inner product (or dot product) of two vectors u = (u1, u2, . . . , un)

and v = (v1, v2, . . . , vn) in Rn is hu, vi = u · v = Pn_i=1uivi. For example, in R3, (1, 0, 0)

and (0, 1, 0) are orthogonal, while (1, 1, 1) is not orthogonal to either of them.

A basis of a vector space V is a linearly independent set that spans that whole vector space. This means that any vector in V can be expressed as a linear combination of the basis vectors, but no element of the basis can be expressed as a linear combination of the others. All the bases for V have the same number of elements. This is the dimension of V . For example, one basis for R3 _{is the set {(1, 0, 0), (0, 1, 0), (0, 0, 1)}. We call this}

specic set the canonical basis of R3. We say a basis is orthonormal when all its elements

are pairwise orthogonal, and the norm of each element equals 1.

A nonempty subset W of V that is closed under the same operations of addition and scalar multiplication that V has is also a vector space. We say that W is a subspace of V . In this work, special interest is dedicated to the subspaces of Rn_{. In the R}3_{, for example,}

(19)

Two subspaces U and W of a vector space V are orthogonal if each vector in U is orthogonal to each vector in W . The set of all vectors in V that are orthogonal to all vectors of a subspace W is the orthogonal complement of W , denoted W⊥_{. In the R}3_{, the}

orthogonal complement of a plane is a line perpendicular to it, crossing the plane at the origin.

Finally, the direct sum of two vector spaces A and B is another vector space, denoted A_{⊕ B that contains pairs (a, b) where a ∈ A and b ∈ B. For instance, R}2 _{= R}_{⊕ R. Given}

two subspaces U and W of V , the vector space V is the direct sum of U and W , that is, V = U _{⊕ W , if U ∩ W = {0}, and for every v ∈ V , there are unique vectors u ∈ U and} w_{∈ W such that v = u + w.}

2.2.2 Matrices

Because our method is based on matrices, a few more concepts related to them are nec-essary. First, it is important to point out that, from here on, we represent the vectors of Rn as column matrices. For example, the vector (1, 2, 3) of R3 is represented as the 3 × 1 matrix h1

2 3

i .

The rank r(M) of a matrix M is the maximum number of linearly independent columns (or rows) of M. The rank of a n × m matrix is at most min(n, m). Also, r(M) = 0 if and only if all entries of M are 0. We show two examples below:

r(     1 0 0 0 1 0 0 0 1     ) = 3, r(     1 2 3 3 2 1 4 4 4     ) = 2.

In the rst example, all three rows are linearly independent, therefore the rank is 3. In the second example, the third line is the sum of the other two, so the rank is 2.

The image im(M) of a matrix M is the space of all vectors that can be obtained by multiplying the M with a vector of Rn_{. That is, given a m × n matrix M, we have}

im(M) = {Mv | v ∈ Rn

}.

A related concept if that of the kernel, or null space, of a matrix M. It is the space of vectors that are mapped by M to the zero vector. That is, given a m × n matrix M with real coecients, its kernel is

ker(M ) =_{{v ∈ R}n_{| Mv = 0}.}

2.2.3 Permutations

In Section 2.3.3 we discuss the algebraic adjacency theory proposed by Feijao and Mei-danis [42], and its relationship to our matrix model. Since the algebraic distance uses permutations, we briey review permutation theory in the sequel.

Given a set E, a permutation α : E → E is a bijective map from E onto itself. Permutations are represented as parenthesized lists, with each element followed by its

(20)

image, and the last element's image is the rst element in the list. For instance, on E =_{{a, b, c}, α = (a b c) is the permutation that maps a to b, b to c, and maps c back} to a. This representation is not unique; (b c a) and (c a b) are equivalent. Two or more permutations are disjoint when their sets of unxed points are disjoint. Permutations are composed of one or more disjoint cycles. For instance, the permutation α = (a b c)(d e)(f) has three cycles. A cycle with k elements is called a k-cycle. A 1-cycle represents a xed element in the permutation and is usually omitted.

The product or composition of two permutations α, β is denoted by αβ. The product αβ is dened as αβ(x) = α(β(x)) for x ∈ E. For instance, with E = {a, b, c, d, e, f}, α = (b d e) and β = (c a e b f d), we have αβ = (c a b f e d).

The identity permutation, which maps every element into itself, will be denoted by 1. Every permutation α has an inverse α−1 _{such that αα}−1 _{= α}−1_{α = 1}_{. For a cycle, the}

inverse is obtained by reverting the order of its elements: (c b a) is the inverse of (a b c). A 2-cycle decomposition of a permutation α is a representation of α as a product of 2-cycles, not necessarily disjoint. All permutations have a 2-cycle decomposition. The norm of a permutation α, denoted by kαk, is the minimum number of cycles in a 2-cycle decomposition of α. For example, the permutation α = (a b c d) can be decomposed as (a b)(b c)(c d), and kαk = 3.

2.3 The Matrix Median Problem

In this section, we show how to model genomes as matrices, and dene a matrix distance that corresponds to the algebraic distance. Such a metric can be useful in the computation of genome medians, and we show here how to compute an approximate solution to the matrix median problem, by solving a system of linear equations.

2.3.1 From genomes to matrices

In this paper, we represent genomes with a formulation similar to the set representation of a genome, used in several related works [41,42,78]. In this representation, each gene a has two extremities, called tail and head, respectively denoted by at and ah, or alternatively

using signs, where −a = ah and +a = at. An adjacency is an unordered pair of extremities

indicating a linkage between two consecutive genes in a chromosome. An extremity not adjacent to any other extremity in a genome is called a telomere. A genome is represented as a set of adjacencies and telomeres (the telomeres may be omitted, when the gene set is given) where each extremity appears at most once. In this way, genomes can be seen as matchings, or sets of untouching edges on the gene extremities.

We then model these genomes as matrices in a very simple way. Given a genome as a list of adjacencies, we model a genome as a corresponding adjacency matrix, with

(21)

telomeres counting as self-loops, i.e., a matrix with coecients aij =        1 if ij corresponds to an adjacency or if i = j and i is a telomere 0 otherwise.

For example, a genome with the adjacencies {{ah, ct}, {at, bh}} can be modeled as the

matrix A =             0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0             .

The columns (and lines) of A represent the extremities ah, at, bh, bt, ch, and ct, respectively.

This mapping produces symmetric matrices A that are invertible, and that satisfy A−1 = AT _{= A}_{, where A}T _{denotes the transpose of matrix A. Also, their order is}

always even. Matrices with 0-1 coecients that satisfy A−1 _{= A}T _{are called permutation}

matrices. They preserve vector norms.

2.3.2 Matrix Distance

Given two n × n matrices A and B, we dene the distance between them as the rank distance [35]:

d(A, B) = r(B_{− A),}

where r(X) denotes the rank of matrix X. This distance can be shown to satisfy the conditions of a metric, that is, it is symmetric, obeys the triangle inequality, and d(A, B) = 0if and only if A = B. The hardest part is the triangle inequality, which we prove below, in Lemma 1.

It is well-known that

r(X) = dimim(X), (2.1)

where dim denotes the dimension of a vector space and im(X) is the image of X, namely, the space of all vectors that can be written as Xv for some v ∈ Rn_{. Here we treat vectors}

as column matrices, that is, n × 1 matrices. Therefore, the distance satises d(A, B) = dimim(B − A).

for every pair of matrices A and B. It is also well-known that

r(X) = n_{− dim ker(X),} (2.2)

(22)

vectors v ∈ R such that Xv = 0, so the distance satises yet another formula: d(A, B) = n_{− dim ker(B − A).}

With this, we are in a position to show the triangle inequality. Lemma 1. For any three n × n matrices A, B, and C, we have

d(A, C)_{≤ d(A, B) + d(B, C).} Proof. It is easy to see that

im(C − A) ⊆ im(B − A) + im(C − B),

because any vector of the form (C − A)v is clearly equal to the sum (B − A)v + (C − B)v of a vector in im(B − A) with a vector in im(C − B).

Using Equation 2.1, we conclude that

d(A, C) = r(C _{− A) = dim im(C − A)} ≤ dim(im(B − A) + im(C − B)) ≤ dim im(B − A) + dim im(C − B) = r(B_{− A) + r(C − B)}

= d(A, B) + d(B, C).

2.3.3 Relationship to known distances

In this section, we contextualize our work, showing that the matrix distance is relevant in the genome context, because it is equivalent to the algebraic distance and very close to the DCJ distance, which has been used extensively in rearrangement theory. This section, however, is not necessary to understand the rest of the paper.

Algebraic Adjacency Theory

According to algebraic rearrangement theory [42], a genome can be seen as a permutation π : E _{7→ E, where E is the set of gene extremities, with the added property that π}2 _{= 1},

the identity permutation. In the algebraic theory, genomes are represented by permuta-tions, with a genome being a product of 2-cycles and 1-cycles, with the 1-cycles being the telomeres, and each 2-cycle corresponding to an adjacency. Figure 2.1 shows an example of a genome and its representation as a permutation.

The distance between two genomes or two permutations π and σ will be dened as kσπ−1

k. It is important to note that, in the original paper, the algebraic distance is dened as kσπ−1_k

2 . However, to avoid dealing with fractional numbers and to simplify the

(23)

+1 ₋₁ 1 −2 +2 −2 −3 +3 −3 +4 ₋₄ 4 +5 ₋₅ 5

Figure 2.1: A genome with one linear chromosome, represented by the permutation π = (_{−1 −2)(+2 −3)(+3 +4)(−4 +5). Notice that +1 and −5 are telomeres, and therefore} are xed points of the permutation.

With these denitions, a distance between two genomes σ and π can be dened as kσπ−1_{k, as mentioned above. The resulting distance is very close to the DCJ distance}

(see next section). For circular genomes, algebraic distance is exactly equal to twice the DCJ distance.

Given this metric, the rst interesting observation is that permutations (including genomes) can be mapped to matrices in a distance-preserving way. Given a permutation α : E _{7→ E, with |E| = n, we rst identify each element v of E with a unit vector of R}n,

and then dene A, the matrix counterpart of α, so that

Av = αv. (2.3)

In Equation 2.3, we use v both as a unit vector of Rn _{in the left side, and as an element}

of E in the right side. We may extend the notation αv to an arbitrary vector v = P aivi

as follows:

αv =Xaiαvi,

where the vi are the unit vectors in the standard base of Rn.

Also, the identity permutation corresponds to the identity matrix I, and the product αβ corresponds to matrix AB, where A is the matrix corresponding to α, and B is the matrix corresponding to β. If α happens to be a genome, that is, if α2 _{= 1}_{, then A is a}

symmetric matrix and vice-versa.

We now show that the mapping from permutations to matrices just dened is distance-preserving.

Lemma 2. For any permutation σ and π, and their respective associated matrices S and P, we have:

kσπ−1

k = d(S, P ). Proof. First, notice that it suces to show that

kαk = r(A − I), (2.4)

for any permutation α and associated matrix A. Indeed, since P is invertible, we have r(S_{− P ) = r(SP}−1_{− I),}

and then Equation (2.4) will relate kσπ−1_{k to the distances of the corresponding matrices}

S and P .

Then proceed to show that, for a k-cycle, Equation (2.4) is true since both sides are equal to k − 1. Finally, for a general permutation, decompose it in disjoint cycles, and

(24)

use the fact that, for disjoint permutations α and β, that is, their sets of unxed points are disjoint, with associated matrices A and B, respectively, we have

ker(A_{− I) ∩ ker(B − I) = ker(AB − I),} and

ker(A_{− I) + ker(B − I) = R}n, which guarantee that

n_{− dim ker(AB − I) = n − dim ker(A − I) + n − dim ker(B − I).} or

r(AB_{− I) = r(A − I) + r(B − I),} because of Equation 2.2.

Therefore, if Equation (2.4) is valid for α and β, and if α and β are disjoint, since kαβk = kαk + kβk for two disjoint permutations, the formula r(AB − I) = r(A − I) + r(B_{− I), valid for the correponding matrices A and B, guarantees that Equation 2.4 is} valid for the product αβ. Since any permutation can be written as a product of disjoint cycles, Equation (2.4) is valid in general.

Because the correspondence between permutations and matrices preserves distances, it makes sense to study the matrix median problem as a way of shedding light into the algebraic genome median problem.

DCJ distance

A genome can also be seen as a matching in a graph where gene extremities are the vertices and adjacencies are the edges connecting them. We can also take two genomes with the same gene content and draw them in a single graph, where the vertices are the gene extremities, and there is an edge between two vertices if they correspond to an adjacency in either genome. The graph may have parallel edges, and every vertex has degree 0, 1, or 2. This graph is called the breakpoint graph, and it is a key data structure for various studies on rearrangement [5].

Using the breakpoint graph, the DCJ distance [82] between two genomes π and σ can be expressed as

dDCJ(π, σ) = N − C +

Peven

2 ,

where N is the number of genes, C is the number of cycles in the breakpoint graph of π and σ, and Peven is the number of paths with an even number of edges.

We can also compute the algebraic distance using the breakpoint graph, and this approach gives us the formula

d(π, σ) = 2 N _{− C +} P 2 , where P is the total number of paths in the breakpoint graph.

(25)

As these formulas suggest, and [42] already showed, the algebraic distance is closely re-lated to the DCJ distance. This fact gives even more motivation to study new approaches to the problem, because the DCJ distance is widely used in the community.

2.3.4 Matrix Median Problem

We now study the median problem on more general matrices. Let A, B, and C be three n_{× n matrices. We want to nd a matrix M such that}

d(M ; A, B, C) = d(M, A) + d(M, B) + d(M, C)

is minimized. In order to have small d(M, A), there must be a large subspace E of Rn

such that, for every vector v ∈ E, we have Mv = Av. In other words, A and M act in the same way in this subspace. This is equivalent to saying that ker(A − M) is large. Similarly for B and C.

This idea is supported by the following result:

Theorem 1. For any n × n matrices A, B, and C there is a median M with the property that for all v ∈ Rn _{such that Av = Bv = Cv we have Mv = Av.}

Proof. Let V1 = {v ∈ Rn|Av = Bv = Cv}. For any n × n matrix X (not necessarily a

median), dene

K(X) =_{{v ∈ R}n_{|Xv = Av = Bv = Cv}.}

Now take a median M that maximizes dim K(M). We claim that M is the sought median, that is, K(M) = V1.

Obviously K(M) ⊆ V1. If K(M) 6= V1, we have that K(M) is a proper subspace of

V1. Because of this, we can take a non null vector v ∈ V1 ∩ K(M)⊥. Without loss of

generality, take v unitary, that is |v| = vT_{v = 1}.

Consider matrix M0 _{= M + (Av}_{− Mv)v}T. We now want to determine the median

score for M0_{. We have that}

d(A, M0) = r(A_{− M}0) = r(A_{− M − (Av − Mv)v}T). We can simplify the right hand expression by factoring A − M:

d(A, M0) = r((A_{− M)(I − vv}T)).

We then can use the property that r(AB) ≤ min(r(A), r(B)), and get to a useful bound for d(A, M0₎_:

d(A, M0)_{≤ min(r(A − M), r(I − vv}T)) ≤ r(A − M)

= d(A, M ).

(26)

and the same conclusion can be drawn on d(B, M )and d(C, M ), leading to d(M0; A, B, C)_{≤ d(M; A, B, C),}

which makes M0 _{also a median.}

Now we proceed to show that the subspace K(M) is properly contained in K(M0₎_.

Let w be a vector of K(M). We have that

M0w = M w + (Av_{− Mv)v}Tw.

Since v is orthogonal to K(M), we know that vT_{w = 0}, and consequently, M0_{w = M w}_,

showing that K(M) ⊆ K(M0₎_{. Moreover, vector v belongs to K(M}0₎_:

M0v = (M + (Av_{− Mv)v}T)v = M v + (Av_{− Mv)v}Tv = M v + (Av_{− Mv)} = Av.

So, we found a new median M0_{with dim K(M}0_{) > dim K(M )}_{, contradicting the choice}

of M. Therefore, the assumption K(M) 6= V1 must be false, which implies that M is a

median that agrees with A, B, and C in V1.

Theorem 1 suggests the following strategy. Decompose Rn _{as a direct sum of ve}

subspaces, V1, V2, V3, V4, and V5, where the following relations are true:

• Av = Bv = Cv for all v ∈ V1,

• Av = Bv 6= Cv for all v ∈ V2,

• Av 6= Bv = Cv for all v ∈ V3,

• Av = Cv 6= Bv for all v ∈ V4, and

• Av 6= Bv 6= Cv 6= Av for all v ∈ V5.

In the rst subspace, since A, B, and C all have the same behaviour, M should also do the same thing. For v in the second subspace, since Av = Bv but Cv is dierent, it is better for M to go with A and B. Likewise, in the third subspace M should concur with B and C, and with A and C in the fourth. Finally, in the nal subspace it seems hard to gain points in two dierent distances, so the best course for M would be to mimic one of A, B, or C.

Therefore, making M equal to A, except in the third subspace, where it should be equal to B (and C) should yield a reasonable approximation of a median, if not a median. The rest of this section will be devoted to showing the details on this construction.

(27)

2.3.5 Partitioning R

n

We begin by introducing notation aimed at formalizing the subspaces V1 through V5

mentioned in the last section (Section 2.3.4). Given n × n matrices A, B, and C, we will use a dotted notation to indicate a partition, e.g., .AB.C. means a partition with two classes, where A and B are in one class, and C is in the other class by itself. To each such partition, we associate a vector subspace of Rn _{formed by those vectors having the same}

image in each class:

V (.AB.C.) =_{{v ∈ R}n_{|Av = Bv}.}

Notice that singleton classes do not impose additional restrictions. Notice also that V (.A.B.C.) = Rn. With this notation, subspace V1 can be written as V (.ABC.).

We need also a notation for subspaces such as V2 where distinct classes actually

dis-agree, that is, subspaces where vectors have dierent images under each class. There can be more than one subspace satisfying this property, but we can use orthogonality to dene a unique subspace. For a partition p, we dene V∗(p) as the orthogonal complement of

the sum of the partitions strictly rened by p with respect to V (p): V∗(p) = V (p)∩ (

X

p<q

V (q))⊥

where p < q means that partition p strictly renes partition q. In other words, we want to capture the part of the subspace V (p) that is orthogonal to the sum of the subspaces corresponding to coarser partitions. Notice that p < q implies V (q) ⊆ V (p).

With three matrices A, B, and C, we have that .A.B.C. strictly renes .AB.C., .BC.A., and .AC.B., while these three partitions strictly rene .ABC.. The ve V∗ subspaces

dened by these partitions are illustrated in Figure 2.2. For example, let

A =       1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1       , B =       1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0       , and C =       0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0       .

For partition .ABC., we have

V∗(.ABC.) = V (.ABC.) =h

h

1 1 1 1 iT

i. Moving on, we solve (A − B)x = 0 to compute the next subspace,

V (.AB.C.) =_hh1 0 0 0 iT ,h0 1 0 0 iT ,h0 0 1 1 iT i, and, since we already computed V (.ABC.), its strict version is

V∗(.AB.C.) = V (.AB.C.)∩ V (.ABC.)⊥ =h

h 1 1 _{−1 −1} iT , h −1 1 0 0iTi.

(28)

Figure 2.2: The partitioning of Rn _{into ve V}

∗ subspaces.

For the parts .AC.B. and .BC.A., we follow a similar process:

V (.AC.B.) =_hh1 0 1 0 iT

,h0 1 0 1 iT

i,

V∗(.AC.B.) = V (.AC.B.)∩ V (.ABC.)⊥=h

h 1 _{−1 1 −1} iT i. and V (.BC.A.) =_hh1 1 1 1 iT i,

V∗(.BC.A.) = V (.BC.A.)∩ V (.ABC.)⊥={0}.

Finally, as we already saw, V (.A.B.C.) = Rn_{. Because the previous subspaces already}

total four dimensions, we have that

V∗(.A.B.C.) = V (.A.B.C.)∩ (V (.AB.C.) ∪ V (.AC.B.) ∪ V (.BC.A.))⊥ =∅.

It is easy to see that the V∗ subspaces are pairwise disjoint, but this is not enough to

prove that their direct sum is Rn. So, the proof will start from the basic sum V

∗(.ABC.)⊕

V∗(.AB.C.), where we already know that V∗(.ABC.)∩ V∗(.AB.C.) ={0}, and it will add

one subspace in the sum at a time, ensuring that the intersection between the new subspace and the sum of the subspaces previously included contains the zero vector only.

Lemma 3. If A, B, and C are arbitrary square matrices, then (V∗(.ABC.) + V∗(.AB.C.))∩ V∗(.BC.A.) ={0}.

(29)

Proof. Notice that

V∗(.ABC.) + V∗(.AB.C.)) = V (.AB.C.).

But by denition

V (.AB.C.)_{∩ V}∗(.BC.A.) ={0}.

Before we add more subspaces to the sum, we will need the result of the following lemma.

Lemma 4. If A, B, and C are permutation matrices, and if 2Bv = Av + Cv

for a given vector v, then Av = Cv.

Proof. Denote by |x| the norm of a vector x. Note that A, B, and C preserve norms, thus |Av + Cv| = |2Bv| = 2|v| = |v| + |v| = |Av| + |Cv|.

But, if the norm of the sum is equal to the sum of the norms, then the two vectors are parallel, and have the same orientation. In other words, there is a positive scalar c such that cAv = Cv. But we already have that |Av| = |v| = |Cv|. Therefore, c = 1.

Lemma 5. If A, B, and C are permutation matrices, then

(V∗(.ABC.) + V∗(.AB.C.) + V∗(.BC.A.))∩ V∗(.AC.B.) = {0}.

Proof. Suppose that u + v + w ∈ V∗(.AC.B.), where u ∈ V∗(.ABC.), v ∈ V∗(.AB.C.),

and w ∈ V∗(.BC.A.). We have that A(u + v + w) = C(u + v + w), which implies

A(v + w) = C(v + w), since Au = Cu. Thus, we have Av + Aw = Cv + Cw Bv + Aw = Cv + Bw B(v_{− w) = Cv − Aw.}

Now we will apply A and C to v − w, and sum the results, obtaining: A(v_{− w) = Bv − Aw}

C(v_{− w) = Cv − Bw}

A(v_{− w) + C(v − w) = B(v − w) + Cv − Aw = 2B(v − w)}

By Lemma 4, we conclude that A(v − w) = C(v − w). But we also have that A(v + w) = C(v + w), which implies Av = Cv and Aw = Cw. In other words, u + v + w ∈ V∗(.ABC.).

(30)

Lemma 6. If A, B, and C are permutation matrices, then

(V∗(.ABC.) + V∗(.AB.C.) + V∗(.BC.A.) + V∗(.AC.B.))∩ V∗(.A.B.C.) ={0}.

Proof. If a vector v is in the set in the left hand side, then it should be in

V∗(.A.B.C.) = V (.A.B.C.)∩ X .A.B.C.<q V (q) !⊥ ⊆ X .A.B.C.<q V (q) !⊥ .

On the other hand,

v _∈ X .A.B.C.<q V∗(q)⊆ X .A.B.C.<q V (q). It follows that v = 0.

With Lemmas 3 through 6 we get to the following theorem: Theorem 2. If A, B, and C are permutation matrices, then

Rn = V∗(.ABC.)⊕ V∗(.AB.C.)⊕ V∗(.BC.A.)⊕ V∗(.AC.B.)⊕ V∗(.A.B.C.).

It is important to observe that Theorem 2 does not apply to general matrices, for instance: A = " 0 0 0 0 # , B = " 0 1 0 1 # , C = " 1 0 1 0 #

With these three matrices, we have

V∗(.ABC.) ={0},

V∗(.AB.C.) =[1 0]T ,

V∗(.BC.A.) =[1 0]T, [0 1]T ,

V∗(.AC.B.) =[0 1]T ,

where hXi denotes the space spanned by the set X. We can see that, in this case,

(V∗(.ABC.) + V∗(.AB.C.) + V∗(.BC.A.))∩ V∗(.AC.B.)6= {0}.

2.3.6 Computing Median Candidates

We now get into further detail in the procedure to compute these matrices. We saw that, when A, B, and C are permutation matrices, Rn _{can be decomposed into a direct sum of}

V∗subspaces. We will now implement the procedure outlined in Section 2.3.4, summarized

(31)

Table 2.1: Distance contribution Given three permutation matrices A, B and C, this table shows the distance contribution of each of the ve subspaces partitioning Rn to the

distances d(MA, A), d(MA, B), and d(MA, C), for a candidate median matrix MA.

Contributes to . . . Subspace MA= . . . d(M, A) d(M, B) d(M, C)

V∗(.A.B.C.) A no yes yes

V∗(.AB.C.) A no no yes

V∗(.BC.A.) B yes no no

V∗(.AC.B.) A no yes no

V∗(.ABC.) A no no no

One way to implement this strategy is to compute projection matrices P1, P2, P3, P4,

and P5 for each of the subspaces and then compute MA as follows:

MA= AP1+ AP2+ BP3+ AP4+ AP5

= AP1+ AP2+ BP3+ AP4+ AP5 + AP3− AP3

= A(P1+ P2+ P3 + P4 + P5) + BP3− AP3

= A + (B_{− A)P}3,

since P1 + P2 + P3+ P4+ P5 = I. To obtain the matrices Pi all we need are the n × ki

matrices Si whose columns form an orthonormal basis of the corresponding subspace.

To build the Si bases, one possibility is to use Function Add below, a basic routine to

expand an orthonormal basis so that it can generate a given extra vector v. It projects the new vector in the orthogonal complement of the subspace generated by the original basis and then adds the normalized projection to form the new basis. Function Add's complexity is O(kn) arithmetic operations (additions, subtractions, multiplications, and divisions).

Function Add(S, v) Expands conditionally an orthonormal basis S so it also gen-erates the vector v.

Data: An orthonormal basis S = {v1, . . . , vk} and a vector v.

Result: A conditionally augmented basis.

1 w_{← v −}Pk i=1viv T_v i 2 if w = 0 then 3 Return S 4 else 5 Normalize w 6 Return S ∪ {w}

The returned basis S0 _{has several important properties:}

(32)

Proof. The set S either equals S, that is an orthonormal set, or equals S ∪ {w}. In the second case, w is orthogonal to every vector in S by construction, and is also normalized, making S0 _{an orthonormal set.}

2. S0

⊇ S.

3. S0 _{generates hhSi ∪ {v}i.}

Proof. Because S0

⊇ S, we have that S0 _{generates hSi. We then only need to show}

that it generates v. From the algorithm, we have that v = w+Pk i=1(v

T

i v)vi. Because

(vT_i v) is a scalar, we can conclude that S0 also generates v.

Algorithm 1 uses Function Add to determine orthonormal bases for each of the ve V∗ subspaces. Also, we dene an auxiliary method null(A), that takes a matrix A, and

returns, also as a matrix, an orthonormal basis for the null space of A. Algorithm 1: Computation of orthonormal bases for each subspace V∗.

Data: Three n × n permutation matrices A, B, and C. Result: Subspace bases S1, S2, S3, S4, S5.

1 S _{← ∅} 2 L_{← null(} " (A_{− B)} (A_{− C)} # ) 3 foreach v ∈ L do 4 S _{← Add(S,v)} 5 S1 ← S // Basis of V∗(.ABC.) 6 L_{← null(A − B)} 7 foreach v ∈ L do 8 S _{← Add(S,v)} 9 S2 ← S − S1 // Basis of V∗(.AB.C.) 10 S _{← S}1 11 L_{← null(B − C)} 12 foreach v ∈ L do 13 S _{← Add(S,v)} 14 S3 ← S − S1 // Basis of V∗(.BC.A.) 15 S _{← S}1 16 L_{← null(C − A)} 17 foreach v ∈ L do 18 S _{← Add(S,v)} 19 S4 ← S − S1 // Basis of V∗(.AC.B.) 20 L_{← null(} h S1 S2 S3 S4 i ) 21 S5 ← ∅ // Basis of V∗(.A.B.C.) 22 foreach v ∈ L do 23 S5 ← Add(S,v)

(33)

Once we have determined orthonormal bases for the V∗ subspaces, we then proceed

to nd projection matrices for them. For every subspace Vi, it is necessary to compute

projections onto hSii along h

S

j6=iSji. For S1 and S5, their projection matrices are simply

P1 = S1S1T and P5 = S5S5T, respectively, because they are both orthonormal bases and

represent subspaces that are orthogonal to Sj6=iSj. Notice that S1has at least one column,

corresponding to vector [1 1 . . . 1]T_.

For S2, S3, and S4, the projection is a bit more complex. Because they are not

necessarily orthogonal to all other Sj, we cannot just use their orthogonal projections.

Instead, for these subspaces Si, it is necessary to use non-orthogonal projections. To

compute a projection Pi, we need the basis matrix Si and a matrix Ni that represents

an orthonormal basis for hSj6=iSji

⊥_{. For example, N}

2 would be an orthonormal basis for

hS1∪ S3∪ S4∪ S5i. With Si and Ni, we can compute Pi = Si(NiTSi)−1NiT [65, Sec. 7.10,

p.634].

To show that this Pi is the sought projection (i = 2, 3, 4), we need to ensure that, for

every v ∈ Rn_{, when we decompose v = v}

1+v2+v3+v4+v5with vj ∈ Sj for j = 1, 2, 3, 4, 5,

we have Piv = vi. Let us verify that this is true in the case i = 2. The other cases are

analogous.

Take a vector v uniquely expressed as v = v1 + v2 + v3 + v4 + v5 with vj ∈ Sj for

j = 1, 2, 3, 4, 5. Its projection onto S2 according to P2 is

P2v = S2(N2TS2)−1N2T(v1+ v2+ v3+ v4+ v5).

Since the columns of N2 are orthogonal to hS_j6=2Sji, we have that N2Tvj = 0for every

j _{6= 2, leaving us with}

P2v = S2(N2TS2)−1N2Tv2.

As v2 is in the image of S2, it can be written as S2u2, and therefore

P2v = S2(N2TS2)−1N2TS2u2

= S2u2

= v2.

Note that, in the computation of P2, P3, and P4, it is not possible to simplify the

right-hand expression Si(NiTSi)−1NiT. In general, neither Si nor Ni are square matrices,

and therefore they are not invertible. Nevertheless, the product NT

i Si is always invertible.

Knowing the projection matrices, we can nally compute the median candidates, in Algorithm 2. Previously, we saw how to compute MA. It is also possible to dene MB

and MC in an analogous way. The matrix MB follows B in V∗(.A.B.C.) instead of A, and

MC follows C. The entire computation takes O(n3) arithmetic operations.

2.3.7 Approximation Factor

We already know that the direct sum of the V∗ subspaces is Rn, when A, B, and C are

(34)

Algorithm 2: Computation of median candidates. Data: Three n × n permutation matrices A, B and C. Result: Three median candidates MA, MB, and MC. 1 Compute the bases S1, S2, S3, S4, and S5 with Algorithm 1 2 P1 ← S1S1T 3 N2 ← null h S1 S3 S4 S5 i 4 P2 ← S2(N2TS2)−1N2T 5 N3 ← null h S1 S2 S4 S5 i 6 P3 ← S3(N3TS3)−1N3T 7 N4 ← null h S1 S2 S3 S5 i 8 P4 ← S4(N4TS4)−1N4T 9 P5 ← S5S5T 10 MA← A + (B − A)P3 11 MB ← B + (A − B)P4 12 MC ← C + (B − C)P2

of these subspaces dimensions. Again, in this section, we will use the matrix MA in the

computations, but the results are also valid for MB and MC. The matrix MA will have a

total distance to A, B, and C equal to:

d(MA; A, B, C) = d(MA, A) + d(MA, B) + d(MA, C)

= dim V∗(.BC.A.) +

dim V∗(.A.B.C.) + dim V∗(.AC.B.) +

dim V∗(.A.B.C.) + dim V∗(.AB.C.)

= 2 dim V∗(.A.B.C.) +

dim V∗(.AB.C.) + dim V∗(.AC.B.) + dim V∗(.BC.A.). (2.5)

There are cases where MA is not a median, even if the input matrices are genomic.

Take, for example:

A =       0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1       , B =       1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1       , and C =       0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1       .

By Equation (2.5), the matrix MA has the following total score:

d(MA; A, B, C) = 2× 2 + 0 + 0 + 0 = 4.

However, the identity matrix I has a better total score than MA, and is actually a

median in this case:

(35)

Thus, given that the procedure described in Section 2.3.6 does not guarantee a ma-trix median, it is interesting to know whether it is an approximation algorithm, namely, whether there is a constant ρ such that the candidate's total score is at most ρ times the score of a median.

In general, consider permutation matrices A, B, and C and let M be a matrix such that d(M; A, B, C) is minimum, that is, M is a median. There is a trivial lower bound for the median score of a matrix, easily obtained with the help of the triangle inequality, namely

d(M ; A, B, C)_≥ 1

2(d(A, B) + d(B, C) + d(C, A)).

According to Equation (2.5), the median score of the approximate solution MA

con-structed in Section 2.3.6 is given by:

d(MA; A, B, C) = 2 dim V∗(.A.B.C.) +

dim V∗(.AB.C.) + dim V∗(.AC.B.) + dim V∗(.A.BC.),

For comparison, we can write the trivial lower bound in terms of subspace dimensions. It suces to write each distance as a dimension sum of the subspaces where they dier. The result is:

1

2(d(A, B) + d(B, C) + d(C, A)) = 3

2dim V∗(.A.B.C.) + dim V∗(.AB.C.) +

dim V∗(.AC.B.) + dim V∗(.A.BC.). (2.6)

Then, to prove that the matrix MA is indeed an approximate solution, it suces to

show that there is a constant ρ such that

d(MA; A, B, C)≤ ρd(M; A, B, C),

for any given matrices A, B, and C. It is possible to demonstrate that 4

3 is an approximate factor for our solution, as

follows:

d(MA; A, B, C) = 2 dim V∗(.A.B.C.) + dim V∗(.AB.C.) + dim V∗(.AC.B.) +

+ dim V∗(.A.BC.)

≤ 4

3 3

2dim V∗(.A.B.C.) + dim V∗(.AB.C.) + + dim V∗(.AC.B.) + dim V∗(.A.BC.)

= 4 3 1 2(d(A, B) + d(B, C) + d(C, A)) ≤ 4₃d(M ; A, B, C).

(36)

result holds for MB and MC.

The constant 4

3 in the approximation factor is tight, as witnessed by the example in

page 34. In spite of that, the method returns an exact median when subspace V5 has

dimension zero.

Theorem 3. If dim V∗(.A.B.C.) = 0, then MA is a median.

Proof. According to Equation (2.5), if dim V5 = dim V∗(.A.B.C.) = 0, the candidate has

score

d(MA; A, B, C) = dim V∗(.AB.C.) + dim V∗(.AC.B.) + dim V∗(.A.BC.).

On the other hand, Equation (2.6) gives us the following lower bound for a median M, if dim V∗(.A.B.C.) = 0:

d(M ; A, B, C) _{≥ dim V}∗(.AB.C.) + dim V∗(.AC.B.) + dim V∗(.A.BC.).

Since d(MA; A, B, C)equals the lower bound in this case, MA is a median.

2.4 Matrix to genome heuristic

As we saw, the median candidates MA, MB, and MC have guarantees about their score,

but are not necessarily genomes. In most applications, however, actual genomes are desirable. In this section, we show a simple heuristic to get a genome matrix from an arbitrary square matrix.

This heuristic is based on the fact that a genome is merely a matching of the gene extremities. From this point of view, an edge in the matching corresponds to an adjacency, and an unmatched vertex is a telomere.

Let M be an arbitrary square matrix. Construct a weighted graph GM = (VM, EM, w)

associated to this matrix as follows. Its vertex set VM has a vertex for every row/column

of M. To build EM, we start with an empty set and add an edge between two vertices i

and j if and only if |Mij| + |Mji| > 0, with weight w(i, j) = |Mij| + |Mji|. Note that it is

not guaranteed that the entries in the matrix M are positive.

Once the graph GM is built, we compute a maximum weight matching HM of GM. This

is a graph where none of the vertices has more than one edge incident to it. Consequently, it corresponds to a genome, where no extremity takes part in more than one adjacency.

2.5 Experimental results

In this section, we discuss our implementation of the algorithms. A critical issue is how we dealt with rounding errors due to the numerical methods. We also show the results of our experiments.

We ran two types of tests. First, tests aiming at validating our algorithm and imple-mentation. These tests used both random permutations and real genomic data. Then,