Scaffolding algorithm using multiple referencegenomes: a case study of the rhizobium ecuadorensecnpso 671t

(1)

DIRETORIA DE PESQUISA E P ÓS-GRADUAÇ ÃO

PROGRAMA DE P ÓS-GRADUAÇ ÃO EM BIOINFORM ÁTICA

(PPGBIOINFO)

HUGO MAURICIO PE ˜NA MERCADO

SCAFFOLDING ALGORITHM USING MULTIPLE REFERENCE

GENOMES: A CASE STUDY OF THE RHIZOBIUM ECUADORENSE

CNPSO 671

T

DISSERTAC¸ ˜AO DE MESTRADO

CORN ´ELIO PROC ´OPIO 2019

(2)

SCAFFOLDING ALGORITHM USING MULTIPLE REFERENCE

GENOMES: A CASE STUDY OF THE RHIZOBIUM ECUADORENSE

CNPSO 671

T

Dissertação de Mestrado apresentada ao Programa de Pós-Graduação em Bioinformática da Universi-dade Tecnológica Federal do Paraná – UTFPR como requisito parcial para a obtenção do t´ıtulo de “Mestre em Bionformática”. Durante o desenvolvimento deste trabalho o autor recebeu aux´ılio financeiro da CNPq.

Orientador: Prof. Dr. Andr´e Yoshiaki Kashi-wabara

Co-orientador: Dra. Mariangela Hungria da Cunha

CORN ´ELIO PROC ´OPIO 2019

(3)

Scaffolding algorithm using multiple reference genomes : a case study of the

Rhizobium ecuadorense CNPSO 671T_{/ Hugo Mauricio Peña Mercado. – 2020.}

103 f. : il. color. ; 31 cm.

Orientador: André Yoshiaki Kashiwabara. Coorientador: Mariangela Hungria da Cunha.

Dissertação (Mestrado) – Universidade Tecnológica Federal do Paraná. Programa de Pós-Graduação em Bioinformática. Cornélio Procópio, 2020.

Bibliografia: p. 97-102.

1. Genoma. 2. Nitrogênio - Fixação. 3. Plasmídeos. 4. Bioinformática – Dissertações. I. Kashiwabara, André Yoshiaki, orient. II. Cunha, Mariangela Hungria da, coorient. III. Universidade Tecnológica Federal do Paraná. Programa de Pós-Graduação em Bioinformática. IV. Título.

CDD (22. ed.) 572.80285

Biblioteca da UTFPR - Câmpus Cornélio Procópio

Bibliotecário/Documentalista responsável: Romeu Righetti de Araujo – CRB-9/1676

(4)

Programa de Pós-Graduação em Bioinformática

Título da Dissertação Nº 12:

“

Scaffolding Algorithm using multiple reference genomes.

A case study of the Rhizobium ecuadorense CNPSo 671T”.

por

Hugo Mauricio Peña Mercado

Orientador: Prof. Dr. André Yoshiaki Kashiwabara

Co-orientadora: Dra. Mariangela Hungria Da Cunha

Esta dissertação foi apresentada como requisito parcial à obtenção do grau de MESTRE EM BIOINFORMÁTICA – Linha de Pesquisa: Genética e Genômica, pelo Programa de Pós-Graduação em Bioinformática – PPGBIOINFO – da Universidade Tecnológica Federal do Paraná – UTFPR – Câmpus Cornélio Procópio, às 17h30min do dia 12 de dezembro de 2019. O trabalho foi __________ pela Banca Examinadora, composta pelos professores:

__________________________________ Prof. Dr. André Yoshiaki Kashiwabara

(Presidente)

__________________________________

Prof. Dr. Fabrício Martins Lopes (UTFPR-CP)

_________________________________

Prof. Dr. Alan Mitchell Durham

(USP-SP)

Visto da coordenação: __________________________________ André Yoshiaki Kashiwabara

Coordenador do Programa de Pós-Graduação em Bioinformática UTFPR Câmpus Cornélio Procópio

A Folha de Aprovação assinada encontra-se na Coordenação do Programa.

Av. Alberto Carazzai, 1640 - 86.300-000- Cornélio Procópio – PR.

(5)

Primero agradezco a mis padres: sin su apoyo y amor incondicional ningún paso de mi vida habr´ıa sido posible. Los guardo siempre en mi corazón. Siento inmensa admiración y respeto por ustedes.

Depois, agradec¸o a todos os professores do mestrado, que compartil-haram seu conhecimento comigo.

Gostaria agradecer especialmente ao meu Orientador, Dr. André Yoshi-aki Kashiwabara, por toda a paciência que teve comigo, não foi fácil cara, mas a gente fez uma amizade que desejo continue além do mestrado.

`A minha co-orientadora, a doutora Mariˆangela Hungria, quem fez poss´ıvel a minha pesquisa.

Agradeço ao Secretário, José, você é o cara; sempre esteve a´ı prestes ajudar. Te desejo êxito na tua vida.

Um especial agradecimento ao Professor Fabr´ıcio, suas aulas me ensi-naram muito; admiro elas.

Agradeço aos amigos que estiveram a´ı para mim, até mesmo na distância.

Agradeço à Capes, ao INCT, ao INCT MPCP-Agro, e à UTFPR-CP; programas do governo que apoiaram essa pesquisa.

Finalmente, agradeço às pessoas que me fizeram bem ao longo dessa etapa, e também as que não; todos tem algo para nos ensinar.

(6)

(7)

MERCADO, Hugo Mauricio Peña. SCAFFOLDING ALGORITHM USING MULTIPLE REF-ERENCE GENOMES: A CASE STUDY OF THE RHIZOBIUM ECUADORENSE CNPSO 671T_{. 103 f. Dissertação De Mestrado – Programa de Pós-graduação em Bioinformática}

(PPG-BIOINFO), Universidade Tecnológica Federal do Paraná. Cornélio Procópio, 2019.

As consequências de longo prazo da utilização dos fertilizantes artificiais começaram a serem percebidas. Além disso, as relações entre plantas e microorganismos no solo (tal como fungos -Mycorrhiza- e bactéria -Rhizobacteria-) vêm se tornando tema de vários estudos que estão pre-ocupados na alimentação de 9.8 bilhões de pessoas no mundo. Uma abordagem para estudar mais profundamente esses microorganismos é através do sequenciamento do DNA. Contudo, as tecnologias de sequenciamento geram sequências curtas, fornecendo um problema computa-cionalmente desafiante devido a presença de repetições e cobertura não-uniforme. Neste tra-balho, é apresentado um algoritmo para o problema de scaffolding utilizando múltiplos geno-mas de referência, que tenta evitar os erros de montagens (missassemblies) e fornecer tanto cro-mossomos putativos, quanto plasm´ıdeos putativos. Embora existam alguns algoritmos para o problema de scaffolding, não foi encontrado nenhum que recebe montagens de genomas em seu estado de contigs como referência, mesmo que essas montagens contenham informações úteis. Além disso, esses algoritmos apenas montam um único scaffold e negligenciam a possibilidade de introduzir misassemblies causados pela utilização de grafos e heur´ısticas. O algoritmo pro-posto oferece como alternativa uma análise mais avançada dos genomas, e a possibilidade de customizar a sa´ıda de acordo com necessidades espec´ıficas. É proposto que o algoritmo ajude na identificação de plasm´ıdeos simbióticos com genomas, encontrando poss´ıveis homólogos nos genomas de referências. Finalmente, uma futura generalização do algoritmo de scaffold-ing poderá ser utilizado não apenas para procariotos, mas também para grandes genomas eu-carióticos.

(8)

MERCADO, Hugo Mauricio Peña. SCAFFOLDING ALGORITHM USING MULTIPLE REF-ERENCE GENOMES: A CASE STUDY OF THE RHIZOBIUM ECUADORENSE CNPSO 671T_{. 103 f. Dissertação De Mestrado – Programa de Pós-graduação em Bioinformática}

(PPG-BIOINFO), Universidade Tecnológica Federal do Paraná. Cornélio Procópio, 2019.

Recently, we started to realize the long-term consequences of artificial fertilizers. Besides, understanding the relationships between plants and micro-organisms in the soil (such as fungus -Mycorrhiza- and bacteria -Rhizobacteria-) has become the center of numerous studies looking forward to feeding a 9.8 billion people world1_{. An approach to further study those organisms}

is the sequencing of its DNA. However, when these sequencing technologies only allow us to generate short-reads, this becomes a challenging computational problem(due to the presence of repeated sequences and non-uniform coverage). Here we present a scaffolding algorithm using multiple-reference genomes, that can discriminate between misassemblies and generate putative plasmids and chromosomes. Although there are many scaffolding algorithms already2_,

we found none of them take as input genomes in the contig stage, even though these genomes might also contain useful information. Furthermore, these scaffolders only take care of the assembly of scaffolds and neglect the possible introduction of misassemblies due to the use of graphs and heuristics. Our algorithm offers an alternative for more advanced analysis of genomes, and the possibility to personalize the outputted scaffolds according to specific needs. We hope our algorithm could help identify symbiotic plasmids within genomes, by finding homologous in reference genomes. Besides, the generalization of scaffolding can be brought not only to prokaryotes but also to larger genomes such as eukaryotes.

Keywords: Genome assembly. Nitrogen fixation. Plasmid. Scaffolding.

1_{The current world population is 7.6 billion, however, according to the United Nations report(RAFTERY et al.,}

2012), it is expected to reach 8.6 billion in 2030, 9.8 billion in 2050 and 11.2 billion in 2100.

(9)

CHAPTER 1 . . . 17 CHAPTER 2 . . . 21

–

FIGURE 1 The Rhizosphere. Proposed model of interactions in the rhizosphere and in the bulk soil. Note that in the rhizosphere, products of the rhizodepo-sition stimulate microbial activity. Therefore, they influence the quimi-cal balance of soil (N mineralization and immobilization) (TOBERGTE; CURTIS, 2013). From Tobergte et al.: “The proportion of total plant pro-duction allocated below ground, and the architecture of the root system depend on the distribution and availability of nutrients in soils”. . . 22 –

FIGURE 2 Genome assembly pipeline. Divided in four stages: data pre-process, contig assembly, scaffold assembly, and post-processing. . . 25 –

FIGURE 3 Contig assembly . From Myers: “The assembly problem is to reconstruct as much of a genome as possible given a collection of reads or read pairs”. 31 –

FIGURE 4 Contig assembly software algorithms. Categorized by their scheme (OverlLayout-Consensus, Alignment-layout-consensus, greedy ap-proach, graph-based, and Eulerian). . . 32 –

FIGURE 5 SPAdes pipeline. . . 34 –

FIGURE 6 K-mers. K-mers are substrings of length k. In this figure, the genome S is divided in one 4-mer (ATTC) and all of its possible 3-mers. . . 35 –

FIGURE 7 From BayesHammer: Reads correction. Grey k-mers indicate non-solid k-mers. Red k-mers are the centers of the corresponding clusters (two grey k-mers striked through on the right are non-solid singletons). As a result, one nucleotide is changed. . . 36 –

FIGURE 8 DBG errors. From SPAdes: “Selected features within a de Bruijn graph. The red h-path, P, is under consideration for deletion or projection to an-other path (bulge corremoval). The blue path(s), Q, are alternative paths. (A) A potential bulge. Q may contain hubs within it, though P does not. (B) A potential tip; h-path P starts or ends at a vertex of total degree 1, and there is an alternative h-path Q. (C) A potential chimeric h-path. There must be alternative h-paths Q1, Q2 both for the entrance and the exit to P. (D) h-path is a repeat. Note that P starts with a vertex of outdegree one and ends with a vertex of in-degree one and has no alternative h-path. These degree conditions differentiate it from (A, B, C).” . . . 37 –

(10)

–

FIGURE 10 Paired DBG. From SPAdes: “Vertices correspond to pairs of k-mers. The major problem is the estimation of the distance between paired reads”. . . . 39 –

FIGURE 11 Closing gaps. From Bankevich et al.: “Gaps in coverage lead to disconti-nuities in the Bruijn graph, sometimes they can be closed if matching paired information is available”. . . 40 –

FIGURE 12 From Green et al.: “Sequence assembly in whole-genome shotgun se-quencing. Individual sequence reads are assembled into sequence contigs. Groups of sequence contigs are then organized into scaffolds on the basis of linking information provided by read pairs. In turn, the scaffolds can be aligned relative to the source genome (represented by an encyclopedia set) by the identification of already mapped, sequence-based landmarks in the sequence contigs, thereby associating them with a known location on the genome map.” Adapted from Venter, J. C. et al. The sequence of the human genome. . . 42 CHAPTER 3 . . . 46

–

FIGURE 13 OpenScaffolder distance estimation. We first compute the sizes of all alignments within ai. We take the start (si1) and end (ei1) base pair

coordi-nates of each alignment of ai in the reference genome (either GR, GD, or

GC), and calculate its distance D(ti1) = ei1 si1. . . 49

–

FIGURE 14 OpenScaffolder pipeline. This image describes the data flow of contigs within the OpenScaffolder algorithm. . . 53 –

FIGURE 15 Graph representations of gi3 [a] and gi6 [b]. Figure [c] is the composed

graph of both [a] and [b]. . . 57 –

FIGURE 16 Graph representations of gi0 [a] and gi13[b]. Figure [c] is the composed

graph of both [a] and [b]. . . 58 CHAPTER 4 . . . 60

–

FIGURE 17 Per base sequence quality. Quality across all bases for files s 6 1.fastq.gz[a] and s 6 2.fastq.gz[b]. From FastQC’s (ANDREWS et al., 2017) documentation: The y-axis are the quality scores. The higher the score the better the base call. The background divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read. . . 61 –

FIGURE 18 Per base sequence GC content for files s 6 1.fastq.gz and s 6 2.fastq.gz, of the validation Escherichia coli multi-cell assembly. Per sequence GC content measures the GC content of each sequence, across the whole library. The theoretical normal distribution is calculated with the modal GC content.

(11)

–

FIGURE 19 Sequence length distribution (Distribution of sequence lengths over all sequences) reported a graph showing the distribution of fragment sizes in both Ecoli multi-cell libraries. . . 64 –

FIGURE 20 Sequence duplication levels (Percent of sequences remaining if dupli-cated), as reported by FastQC. For the Escherichia coli multi-cell libraries s 6 1.fastq.gz [a], and s 6 2.fastq.gz [b]. . . 65 –

FIGURE 21 Icarus representation of contigs. From top to bottom: The first line rep-resents MeDuSa’s Test1, The second line is MeDuSa’s Test2, and the third line represents the output of OpenScaffolder. Green shades are correct and similar contigs. Red and orange represent relocations. Local misassemblies are represented in grey. . . 65 –

FIGURE 22 GC(%) content for the Escherichia coli multi-cell dataset: contigs are broken into non-overlapping 100bp windows. Plot shows number of win-dows for each GC percentage. MeDuSa’s Test1 is the red line. MeDuSa’s Test2 is the blue line. OpenScaffolder’s GC% is the green line. . . 67 –

FIGURE 23 Dot plot for the Escherichia coli str. K-12 substr. MG1655 reference assembly against itself. Generated from the nucmer output. The reference sequence is laid across the x-axis. The Query sequence is on the y-axis. Wherever the sequences agree, a colored line or dot is plotted. The forward matches are displayed in purple. Reverse matches are displayed in light blue. Because the two sequences are identical, a single red line appears from the bottom left to the top right. . . 68 –

FIGURE 24 Dot plot for the Escherichia coli multi-cell MeDuSa assembly against the reference genome. Generated from the nucmer output. The reference sequence is laid across the y-axis. The Query sequence is on the x-axis. Wherever the sequences agree, a colored line or dot is plotted. The forward matches are displayed in purple. Reverse matches are displayed in light blue. Four small gaps are present. Two overlaps of contigs are displayed as parallel lines. A small bulge is present too. . . 69 –

FIGURE 25 Dot plot for the Escherichia coli multi-cell OpenScaffolder assembly against the reference genome. Generated from the nucmer output. The reference sequence is laid across the y-axis. The Query sequence is on the x-axis. Wherever the sequences agree, a colored line or dot is plotted. The forward matches are displayed in purple. Reverse matches are displayed in light blue. No gaps are present. No overlaps either. No bulges are present. 70 –

FIGURE 26 Cumulative length. From QUAST (GUREVICH et al., 2013) documenta-tion: Cumulative length shows the growth of contig lengths. On the x-axis, contigs are ordered from the largest to smallest. The y-axis gives the size of

(12)

is represented in green. . . 71 CHAPTER 5 . . . 72

–

FIGURE 27 Per base sequence content across all bases of the fastq files 671 S3 L001 R1 001.fastq and 671 S3 L001 R2 001.fastq before trim-ming. Generated from the FastQC output. The x-axis represents the po-sition in the read. The y-axis is the percentage of each DNA base called. There is a bias in both libraries, starting from the first bases that normalizes around the position 18 in the reads. . . 74 –

FIGURE 28 Per base sequence quality for files [a] 671 S3 L001 R1 001.fastq and [b] 671 S3 L001 R2 001.fastq, of the case study Rhizobium ecuadorense CNPSo 671T_{. From FastQC’s documentation: The central red line is the}

median value. The yellow box represents the inter-quartile range (25-75%). The upper and lower whiskers represent the 10% and 90% points. The blue line represents the mean quality. The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). . . 75 –

FIGURE 29 Per base sequence GC content for files 671 S3 L001 R1 001.fastq and 671 S3 L001 R2 001.fastq, of the case study Rhizobium ecuadorense CNPSo 671T_{. Per sequence GC content measures the GC content of each}

sequence, across the whole library. The theoretical normal distribution is calculated with the modal GC content. The theoretical normal distribution is shown in blue. The central peak is the overall GC content. . . 76 –

FIGURE 30 Per base sequence content across all bases of the fastq files 671 S3 L001 R1 001.fastq [a] and 671 S3 L001 R2 001.fastq [b] after trimming. Generated from the FastQC output. The x-axis represents the position in the read. The y-axis is the percentage of each DNA base called. Both libraries show significant improvement at position 72 of the sequences, which suggests trimming had a positive effect on the bias at the end of the them. . . 77 –

FIGURE 31 Dot plots for the Rhizobium ecuadorense’s largest scaffold vs Rhizobium acidisoli’s chromosome. OpenScaffolder [a] MeDuSa [b].Generated from the nucmer output. The reference sequence is laid across the y-axis. The Query sequence is on the x-axis. Wherever the sequences agree, a colored line or dot is plotted. The forward matches are displayed in purple. Reverse matches are displayed in light blue. . . 79 –

FIGURE 32 OpenScaffolder dot plots for the Rhizobium ecuadorense second [a], third [b], fourth [c] and fifth [d] scaffolds vs Rhizobium acidisoli’s plasmids.

(13)

agree, a colored line or dot is plotted. The forward matches are displayed in purple. Reverse matches are displayed in light blue. . . 80 –

FIGURE 33 GC(%) content: contigs are broken into non-overlapping 100 bp win-dows. Plot shows number of windows for each GC percentage. . . 82 CHAPTER 6 . . . 85

–

FIGURE 34 Per base sequence quality for files ecolik12 1.fq [a] and ecolik12 2.fq [b], of the Escherichia coli str K12 substr MG1655 chromosome. From FastQC’s documentation: The central red line is the median value. The yel-low box represents the interquartile range (25-75%). The upper and yel-lower whiskers represent the 10% and 90% points. The blue line represents the mean quality. The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). . . 90 –

FIGURE 35 Per base sequence GC content for files ecolik12 1.fq [a] and eco-lik12 2.fq [b], of the Escherichia coli str K12 substr MG1655 chromosome. Per sequence GC content measures the GC content of each sequence, across the whole library. The theoretical normal distribution is calculated with the modal GC content. The theoretical normal distribution is shown in blue. The central peak is the overall GC content . . . 91 –

FIGURE 36 Sequence length distribution (Distribution of sequence lengths over all sequences) reported a graph showing the distribution of fragment sizes in both Escherichia coli str K12 substr MG1655 synthetic libraries. . . 92 –

FIGURE 37 Icarus representation of contigs. From top to bottom: The first line rep-resents MeDuSa’s Test1, The second line is MeDuSa’s Test2, and the third line represents the output of OpenScaffolder. Green shades are correct and similar contigs. Red and orange represent relocations. Local misassemblies are represented in grey. . . 92 –

FIGURE 38 OpenScaffolder dot plots for the Escherichia coli K12 substr MG1655 synthetic assembly. Reference against itself [a], MeDuSa with one refer-ence genome [b], MeDuSa with multiple referrefer-ence genomes [c], and Open-Scaffolder [d] scaffolds. Generated from the nucmer output. The reference sequence is laid across the y-axis. The Query sequence is on the x-axis. Wherever the sequences agree, a colored line or dot is plotted. The forward matches are displayed in purple. Reverse matches are displayed in light blue. . . 93

(14)

CHAPTER 1 . . . 17 CHAPTER 2 . . . 21

–

TABLE 1 Multiple-Reference based scaffolders. . . 26 CHAPTER 3 . . . 46

–

TABLE 2 Hypothetical alignment between a genome without repeats against itself. 52 –

TABLE 3 Rhizobium acidisoli strain FH23 (Compared by DNA sequence) . . . 54 –

TABLE 4 Rhizobium etli CFN42 (Compared by DNA sequence) . . . 54 –

TABLE 5 Sharing contigs for the Rhizobium case study. . . 55 –

TABLE 6 Net gain (Ng) of all gi. . . 56

CHAPTER 4 . . . 60 –

TABLE 7 Reference genomes for the Escherichia coli Multi-cell assembly. Es-cherichia coli str K12 substr MG1655 is the only complete genome, with one chromosome of 4639675bp. . . 62 –

TABLE 8 Quast metrics for MeDuSa Test1 (With only one reference genome), MeDuSa Test2 (with multiple reference genomes) and OpenScaffolder. . . . 66 CHAPTER 5 . . . 72

–

TABLE 9 Reference genomes for the Rhizobium ecuadorense assembly . . . 73 –

TABLE 10 Rhizobium etli CFN42. Reference genome for the Rhizobium ecuadorense assembly . . . 73 –

TABLE 11 Rhizobium acidisoli FH23. Reference genome for the Rhizobium ecuadorense assembly. . . 74 –

TABLE 12 Summary of both inputs (raw libraries) and outputs (the resulting trimmed files) . . . 77 –

TABLE 13 Quast metrics for MeDuSa Test1 (With only one reference genome), MeDuSa Test2 (with multiple reference genomes) and, OpenScaffolder . . . 78 CHAPTER 6 . . . 85

–

TABLE 14 Quast metrics for MeDuSa Test1 (With only one reference genome), MeDuSa Test2 (with multiple reference genomes) and, OpenScaffolder . . . 88

(15)

pH pH = log[H+]where log is the base-10 logarithm and [H+] stands for the hy-drogen ion concentration in units of moles per liter solution. The term “pH” comes from the German word “potenz”, which means “power”, combined with H, the element symbol for hydrogen, so pH is an abbreviation for “power of hydrogen”. PE According to Illumina: “Paired-end sequencing allows users to sequence both ends

of a fragment and generate high-quality, alignable sequence data. Paired-end se-quencing facilitates detection of genomic rearrangements and repetitive sequence elements, as well as gene fusions and novel transcripts.”

NGS Next Generetion Sequencing. From NCBI: massively parallel or deep DNA se-quencing.

WGS Whole Genome Sequencing. From wikipedia: Whole genome sequencing is os-tensibly the process of determining the complete DNA sequence of an organism’s genome at a single time. This entails sequencing all of an organism’s chromoso-mal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.

MDA Multiple displacement amplification. From Wikipedia: Is a non-PCR based DNA amplification technique. This method can rapidly amplify minute amounts of DNA samples to a reasonable quantity for genomic analysis.

DBG De Bruijn graph. From Wikipedia: “In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn_{vertices, consisting of all possible length-n sequences of the}

given symbols; the same symbol may appear multiple times in a sequence”. OS Open Scaffolder, a scaffold assembly algorithm written in the Python language C Minimum cluster variable. Used in OpenScaffolder’s algorithm, defined -after

FastQC’s definition- as the minimum length of aligned base pairs clustered to take into consideration as a hit alignment.

GLR Genome Linear Representation, an OpenScaffolder 2D representation of the genome’s contig’s positions, to try to estimate the position of each contig within the drat scaffold.

pDist Phylogenetical distance amongst the reference genomes, in which the closest the reference genome is to the target, the smaller pDist is.

NAC NAC (Non-Aligned Contigs) is the subset of contigs within tn that have not been

aligned to the reference genome.

Ng Net gain. Its an OpenScaffolder variable. It is defined as the quantity of information

(16)

1 INTRODUCTION . . . 17

1.1 DISSERTATION OBJECTIVES . . . 19

1.2 OVERVIEW OF THE STUDY . . . 20

2 BACKGROUND . . . 21

2.1 THE RHIZOSPHERE . . . 21

2.2 STUDY DESIGN: FROM READS TO A CLOSED GENOME . . . 23

2.2.1 Data pre-process . . . 27

2.2.2 Contig Assembly . . . 31

2.2.3 SPAdes: contig assembly algorithm . . . 33

2.2.4 Scaffold assembly . . . 38

2.2.5 Types of scaffold assembly algorithms . . . 41

3 OPENSCAFFOLDER IMPLEMENTATION . . . 46

3.1 OPENSCAFFOLDER: SCAFFOLD ASSEMBLY ALGORITHM . . . 46

3.2 OS: ORIENTATION . . . 47

3.3 OS: ORDERING . . . 48

3.4 OS: USAGE, INPUT AND OUTPUT FILES . . . 49

3.5 DIFFERENCES BETWEEN MEDUSA AND OPENSCAFFOLDER . . . 50

3.6 OS: PIPELINE . . . 51

3.7 OS: REPEATS AND GAINED INFORMATION . . . 54

3.8 OS: JOINING SCAFFOLDS WITH GRAPHS . . . 55

3.9 RESEARCH PACKAGE I - GENOME ASSEMBLY . . . 59

4 ESCHERICHIA COLI MULTI-CELL ASSEMBLY . . . 60

4.1 MATERIALS AND METHODS . . . 60

4.1.1 Target Genome: library . . . 60

4.1.2 Reference genomes . . . 60

4.1.4 Contig assembly . . . 64

4.1.5 Scaffold assembly . . . 64

(17)

5.1.4 Contig and scaffold assemblies . . . 78

5.1.5 Quality assessment . . . 79

5.2 RESULTS . . . 80

5.3 RESEARCH PACKAGE II - VALIDATION . . . 84

6 ESCHERICHIA COLI K12 SUBSTR MG1655 ASSEMBLY . . . 85

6.1.4 Contig and scaffold assemblies . . . 86

6.1.5 Quality assessment . . . 87

6.2 RESULTS . . . 87

7 DISCUSSION AND CONCLUSIONS . . . 94

7.1 SCAFFOLDING PUTATIVE ELEMENTS . . . 94

7.2 A WORD ABOUT GENOME ASSEMBLY . . . 94

7.3 REPEATS IN THE GENOME, SCAFFOLDS, AND CONTIGS . . . 95

7.4 PHYLOGENY . . . 96

7.5 CONCLUSIONS . . . 96

8 APPENDIX . . . 103

8.1 RESEARCH PACKAGE I: RHIZOBIUM ECUADORENSE CNPSO 671T PER BASE SEQUENCE CONTENT . . . 103

(18)

The structure of this dissertation is based on the amazing Thesis of Thais Vasconce-los. Chapters 4, 5, and 6 are self containing blocks and contain what may appear as repetitive content.

(19)

1 INTRODUCTION

We all want healthy food. Although meat meets many of our nutritional needs, we need fruits and vegetables to obtain essential vitamins, minerals, and plant chemicals. For example, mu-tations in the GLO gene coding for L-gulono-⌥-lactone oxidase (LACHAPELLE; DROUIN, 2011) (the enzyme responsible for catalyzing the last step in the vitamin C biosynthetic path-way) lead to the inability to biosynthesize ascorbic acid in primates, guinea pigs, teleost fishes, bats, and birds. Although we could get vitamin C from raw meat (since most of it is degraded when cooked), this could lead to bacterial infections.

Nonetheless, plants need nutrients for their growth, too, and for years, it has been widely accepted that ammonia is an essential fertilizer. Ammonia can be produced mining niter deposits or by synthetic means (via the Haber-Bosh process (LECTURE, 1932)). The Haber process is the industrial procedure to artificially fixate nitrogen (N2) to produce synthetic

am-monia (NH3). Synthetic ammonia has been called the detonator of the population explosion

(VACLAV, 1999), which enabled the global population to increase from 1.6 billion (in 1900) to today’s 7.6 billion. However, producing synthetic ammonia is very expensive, and its ex-tensive use comes from a time where we did not understand the long term consequences of its use (ALONSO; CAMARGO, 2009; ERISMAN; MONTENY, 1998; WILLIAMS; BONHAM; BERNIER, 2017). Besides its profound ecological impact due to carbon emissions during its production, synthetic ammonia weaknesses the soil’s nutrients over time. Besides, it has been reported to affect water’s pH, and even become toxic for certain aquatic organisms, thus, affect-ing aquatic ecosystems. Soils that use synthetic fertilizer do not have the required biodiversity to help plants grow normally. Soils also become more compact, making it harder for the plant’s roots to grow. Plants that use synthetic ammonia tend to be weaker and lack the protection of bacteria and fungi. Furthermore, they are attacked by these same micro-organisms (bacteria and fungus).

One of the solutions for plant diseases is to spray them. However, this is nothing more than attacking the symptoms. Healthy plants need healthy soil. Although most of the adverse effects of fertilizers are solved with pesticides, plants that are treated tend to become diseased in the long term. Within these observations, it becomes clear that neither synthetic ammonia nor pesticides work as we expected. Therefore, some questions arise: how do we make plants healthier? How do we avoid the use of synthetic ammonia? How do we improve the soil’s

(20)

nutrients to strengthen the plant grow? And finally, how do plants grow?.

Plant’s grow depends on symbiosis with two organisms: fungus and bacteria (TO-BERGTE; CURTIS, 2013). Both organisms supply nutrients from the soil and get sugars (glu-cose) in exchange (TOBERGTE; CURTIS, 2013).

One problem of plants is that they only use four to seven percent of the soil volume. Nonetheless, bacteria associated with plants (called Rhizobacteria) are specialized in extracting K+_{, P O}

4, NO3, and Mg2+ from the soil. When Rhizobia which is a freeliving bacteria

-forms a symbiosis with plants, they create a natural defense system around the root, called the Rhizosphere (TOBERGTE; CURTIS, 2013). The Rhizosphere keeps all attacking organisms away from the plant’s roots. Plants need not only Rhizobia but also a fungus, called Mycor-rhiza (TOBERGTE; CURTIS, 2013). Plants form a symbiosis with MycorMycor-rhiza fungus, which increases its absorption capacity. However, the extensive use of synthetic ammonia makes it hard to find Mycorrhiza in the soil. Mycorrhiza hyphae create an absorption and transporta-tion system by growing where the plant’s roots cannot (micropores in the soil), improving the plant’s nutrient absorption. Therefore, it becomes clear that to improve the soil, we need to find the perfect balance between a symbiotic triangle: plants, bacteria, and fungi (TOBERGTE; CURTIS, 2013).

This research project makes part of a broader project (The INCT-MPCPAgro, coordi-nated by Mariangela Hungria), which tries to increase the productivity of plants with inoculants, trying to reduce the usage of synthetic ammonia. To tackle this, the idea is to estimate strategies to optimize symbiotic combinations. Within this problem, the need to understand plant-bacteria symbiosis arises, therefore, Embraja Soja-Londrina, collected and sequenced Rhizobia from the Peru-Ecuador zone (genomes from this pool can help enlighten our understanding about the coevolution of the symbiosis (RIBEIRO et al., 2015a).

This project focuses on one problem but tackles it with two approaches in mind, the biological and the computational. Within this project, the assembly improvement of that bac-teria, called Rhizobium ecuadorense CNPSo 671T_{, is our biological problem. The technology}

used for sequencing the bacteria is the MySeq platform, which generated short paired-end reads (PE) of 75bp (2 ⇥ 150 bp). Due to the presence of repeats in the genome, and the non-uniform coverage, the obtained assembly is in the contig form.

From a biological point of view, we aim to improve these contigs and get larger contiguous sequences (called scaffolds, which generally are gapped) to try to reconstruct the genome’s chromosome and plasmids. One of our main limitations are the libraries; they are made of short reads. Short reads make it impossible to resolve repeats larger than the read’s length, within the genome. There is also the fact that the obsolescence of short read algorithms is on the horizon (due to recent advances in read’s length sequencing technologies). However,

(21)

there are still thousands of short-read libraries in databases1_{, so further research might help}

bring new hybrid algorithms that could combine both short and long-read sequences to get the most of both technologies.

We acknowledge the fact that there are numerous scaffolding tools out there. However, during the development of this research, we found that scaffolding tools used by the bioinfor-matics community tend to ignore repeats within the genome. Besides, their methods for solving the scaffolding problem (e.g., heuristics, use of graphs) might be responsible for introducing misassemblies during the scaffolding process. Therefore, we aim to improve the scaffolding technologies for reference-based scaffolding assemblers, and by doing so, we will upgrade the assembly of our case study, the Rhizobium ecuadorense CNPSo 671T_{. As it might be already}

clear for the reader, this is our computational aim, which will have an impact on the biological goal.

1.1 DISSERTATION OBJECTIVES

The main objectives of this dissertation are:

1. Improve the assembly of the Rhizobium ecuadorense CNPSo 671T_.

The Rhizobium ecuadorense CNPSo 671T _{comes from one of the proposed centers of}

genetic diversification of Common bean (Phaseolus vulgaris L.)(RIBEIRO et al., 2015a). Rhizobia genomes from this region (The Peru-Ecuador genetic pool) could help enlighten our understanding of the coevolution of the symbiosis (RIBEIRO et al., 2015b).

2. Try to resolve its plasmids and chromosome.

Finishing genomes is one of the milestones of sequencing. However, short-reads make this task computationally challenging. Within these limitations, draft sequences have been reported of high value (they help estimate the number of genes and their classification, amongst comparative genomics between related organisms)(BIASZKOWICZ, 2016). 3. Try to upgrade the assembly scaffolding technology.

During this research, we found some critical limitations on reference-based assembly scaffolders. Therefore, we decided to try to upgrade the reference-based scaffolding as-sembly algorithm.

1_{According to Kremer et. al., from GenBank (2017), from 87,956 prokaryote genomes in GenBank, only 6,586}

(22)

1.2 OVERVIEW OF THE STUDY

This dissertation consists of five further chapters within three main parts, and one additional chapter dedicated as the general appendix. In Part I (chapters 2 and 3), we describe all the back-ground materials and theory needed for understanding and further develop this research project (Chapter 2). We also explain in detail the implementation of the OpenScaffolder algorithm (Chapter 3). In Part II (Chapters 4 and 5), we describe the experiments we used for the vali-dation of our algorithm (Chapter 4, Escherichia coli multi-cell assembly), and the assembly of our case study (chapter 5, Rhizobium ecuadorense CNPSo 671T_{assembly). Part II is contained}

within the Research Package I, which covers experiments made with real libraries. Finally, in Part III, we discuss the results and present our conclusions (Chapter 7).

(23)

2 BACKGROUND

In this chapter, we provide the background materials needed for this research project. We start by presenting a brief explanation of what the rhizosphere is. This will help to understand the symbiosis between plants and microorganisms, thus, justifying our primary goal (the improve-ment of the de novo assembly of the Rhizobium ecuadorense CNPSo 671T_{). Then, we present}

a brief introduction to the genome assembly process, and justify the need for an improved scaf-folder algorithm. Finally, we explain all concepts behind our proposed assembly pipeline.

2.1 THE RHIZOSPHERE

Soils are composed of various biological sections called spheres: detritusphere, drilosphere, porosphere, aggregatusphere, and rhizosphere (TOBERGTE; CURTIS, 2013). Furthermore, each sphere’s composition depends on ambient conditions, giving each of them its properties.

The rhizosphere (Figure 1) is all the soil surrounding the roots and is affected by plant activity. Although -chemically and physically- the bulk soil and the rhizosphere are different, both the rhizosphere and the rhizoplane are widely colonized by microorganisms. However, the amount of organisms per gram of the rhizosphere soil is greater than in the bulk soil (Mostly because of the rhizodeposition: root exudates, sloughed senescent root cells, and mucigel (TO-BERGTE; CURTIS, 2013)).

The rhizosphere is influenced by the plant’s roots, in there, rhizodeposition alters the balance between nitrogen (N) mineralization and immobilization. Moreover, two effects of this influence are the increment in the biomass of soil microflora, and the diminution of fungal species on the rhizoplane (the root surface) (TOBERGTE; CURTIS, 2013). Therefore, root architecture is influenced -and influences- the physical, chemical, and biological properties of soils. Besides, its development is directly influenced by soil fertility.

Microbial diversity is really important for biogeochemical transformations. The rhi-zosphere is found in aerobic soils, it is in there that microbes grow and release nutrients that benefit plants. Microbes live all around the growing root, having access to organic substrates derived from the root, this explains the difference of organisms between the rhizosphere and the bulk soil. In the growing root, the tip is secreting exudates to lubricate it to pass through

(24)

Figure 1: The Rhizosphere. Proposed model of interactions in the rhizosphere and in the bulk soil. Note that in the rhizosphere, products of the rhizodeposition stimulate microbial activity. Therefore, they influence the quimical balance of soil (N mineralization and immobilization) (TO-BERGTE; CURTIS, 2013). From Tobergte et al.: “The proportion of total plant production al-located below ground, and the architecture of the root system depend on the distribution and availability of nutrients in soils”.

Source: (TOBERGTE; CURTIS, 2013)

the soil, providing carbon for bacteria and fungi, which immobilize nitrogen and phosphorus (TOBERGTE; CURTIS, 2013).

The rhizosphere’s food web is a complex network of root-feeding insects and microbes with symbiotic relationships, such as mycorrhizal fungi, rhizobia, and Frankia.

For most ecosystems productivity, nitrogen (N) is of fundamental importance; soil microorganisms have a direct influence over those N reservoirs in the soil.

Most organisms (both prokaryotic and eukaryotic) assimilate N into the organic form, and then releases inorganic N; that is the N cycle. However, N2 fixation, nitrification, and

denitrification by bacteria are the key to understand its influence in the availability and form of N in ecosystems.

(25)

2.2 STUDY DESIGN: FROM READS TO A CLOSED GENOME

Nowadays, sequencing technologies have evolved to the point of delivering reads of 2.3 million base pairs (PAYNE et al., 2019). The Garvan Institute in Sydney reported a million base pair read in 2017. Nonetheless, there are thousands of projects whose sequencing technologies allowed only small reads (from about 50 bp to 300 bp). Assembling a whole genome from those small reads is a computationally challenging problem.

We define this problem based on Myer’s definition (Myers Jr, 2016), from a general point of view, as follows: A genome is the complete set of genetic material from a cell’s or-ganism. Within it, it can contain M elements (From one to n chromosomes and from zero to pplasmids, since some organisms do not have plasmids1), which can variate from genome to genome (M can be any number from one to m)(Myers Jr, 2016). However, with our current technologies, we can not read those M elements directly. What we can do, is prepare libraries by splitting those M elements into smaller sequences, which we call reads. Therefore, to re-construct the genome, we need to assembly M shortest superstrings (COMPEAU; PEVZNER; TESLER, 2011), which are larger strings containing as much of the reads as possible, called substrings.

Computationally, we defined our problem, based on Myer’s definition (Myers Jr, 2016), as follows: Given a set of reads (computationally known as strings), reconstruct the genome from which they came.

• Input: A set of strings (reads)

• Output: A genome of length M (Containing M elements)

The ideal scenario would be to output a closed genome. However, a significant percent-age of the genome’s length is reported to have repeated sequences. Therefore, the problem with short reads is that it is impossible to resolve them unless we have long enough reads (CHITSAZ et al., 2011). In other words, we can not reconstruct a repeat of length V , unless we have reads longer than V .

Within these observations, we can reformulate the assembly problem. However, before we do, there is still one more thing to add. The genome assembly problem can be divided into two stages: contig assembly and scaffold assembly. We can think of both stages as instances of the previously defined problem. So we will define them as follows:

1. Contig assembly. Given a set of reads, construct all of its shortest superstrings.

1_{From wikipedia: “(...)A plasmid is a small DNA molecule within a cell that is physically separated from}

chro-mosomal DNA and can replicate independently. They are most commonly found as small circular, double-stranded DNA molecules in bacteria; however, plasmids are sometimes present in archaea and eukaryotic organisms(...)”

(26)

• Input: a set of reads

• Output: a set of all the shortest superstrings, called contigs

2. Scaffold assembly. Given a set of contigs, reconstruct the genome, composed of M ele-ments. Where each element M is called a scaffold.

• Input: a set of contigs • Output: a set of M scaffolds

Both stages will take sets of strings as an input (reads and contigs, for contig assembly and scaffold assembly respectively), and output a more extended set of superstrings (contigs and scaffolds, for contig assembly and scaffold assembly respectively). The set M of scaffolds could be -or not- the complete set of M elements of the target genome (The output of M scaffolds will be greater than or equal to the set of M elements of the genome). With small reads as input, the output will probably be a draft set of scaffolds. This is as far as we can go with only small reads.

Figure 2 shows this process more in detail. Our genome assembly pipeline is divided into four stages: Data pre-process, contig assembly, scaffold assembly, and post-processing.

In the data pre-process stage we control the quality of the inputted libraries. Depend-ing on the quality control reports (from FastQC (ANDREWS et al., 2017)), we proceed to filter adapters, overrepresented sequences, and low-quality base calls from the reads (with Trim Ga-lore (KRUEGER, 2015)). If the resulting reads do pass the quality control, we proceed to the next stage.

In the contig assembly stage we proceed to assemble reads into larger contiguous se-quences(contigs), using SPAdes assembler (BANKEVICH et al., 2012). The resulting contigs are passed on to the next stage.

In the scaffold assembly stage we scaffold contigs into larger un-contiguous gapped sequences (scaffolds). We used two algorithms for this stage, MeDuSa (BOSI et al., 2015), and OpenScaffolder. The resulting scaffolds are then passed to the next stage.

In the post-processing or final stage of our pipeline, we take scaffolds and perform a quality check with QUAST (GUREVICH et al., 2013) software. If the reported scaffolds meet the expected quality, the process is considered finished. However, if the scaffolds fail the QUAST quality test, we re-run whichever necessary stage (data pre-processing, contig assem-bly, or scaffold assembly).

(27)

Figure 2: Genome assembly pipeline. Divided in four stages: data pre-process, contig assembly, scaffold assembly, and post-processing.

Source: (ANDREWS et al., 2017; KRUEGER, 2015; BANKEVICH et al., 2012; BOSI et al., 2015; GUREVICH et al., 2013)

Selection of pipeline’s tools

For the selection of our tools, we do not pretend to reinvent the wheel. We acknowledge there are state-of-the-art tools out there (WAJID; SERPEDIN, 2012), widely accepted and used by the community (e.g., SPAdes (BANKEVICH et al., 2012), Velvet (ZERBINO; BIRNEY, 2008)). However, we only choose those that accept multiple reference genomes. Table 1 shows a sum-mary of Multiple-reference based scaffolders.

MeDuSa (BOSI et al., 2015) accepts closed and draft reference genomes, however it is prone to noise when contig-stage genomes are introduced. Multi-CAR (CHEN et al., 2016a) only accepts closed reference genomes, and its principal constrains are the need for weights for the reference genomes, and the need to run only in a server. Chromosomer (TAMAZIAN et al., 2016) does accept both closed and draft reference genomes. Although this tools is the only one that has an algorithm similar to ours, its major drawback is that its alignments require

(28)

Table 1: Multiple-Reference based scaffolders. Scaffolder Closed Draft Contig Align. method Constrains

MeDuSa X X - Mummer3 Ref. has to be Complete or Draft

Multi-Car X - - Mummer Weights for ref_{Server only} Chromosomer X X - BLAST _{Start & end pos.}Scores values OpenScaffolder X X X Mummer4 Input Ref. split

Source: (BOSI et al., 2015), (CHEN et al., 2016a), (TAMAZIAN et al., 2016), OpenScaffolder.

associated score values (such as those of the BLAST alignment tool). Besides, it requires the user to provide the start and end positions of aligned regions of fragments and the reference chromosomes. On the other hand, OpenScafolder does not have constrains about the input reference genomes. It accepts either closed, draft or contig-stage reference genomes as inputs. However, it is important to mention that the quality of its outputs depend completely on the quality (and phylogenetic distance) of the input genomes. Furthermore, its major drawback is the need for the inputs genomes being separated before being fed to the algorithm(i.e., if we have a reference with one chromosome and three plasmids, it has to be separated into four fasta files, one for each element of the genome).

Why another assembly scaffolder?

As we progressed our journey into the understanding of scaffolding, some facts became clear to us. First, the quality of the contigs retrieved from the assembly contig stage was as good as we could get, so this stage got established (for more details, please refer to section 2.2.2). In the second stage, the set of assembled scaffolds retrieved (from the MeDuSa algorithm), always had a High Genome Fraction and large scaffolds. However, the number of misassemblies was far from ideal. One more detail that bothered us was the number of scaffolds MeDuSa was returning as the set M. For the Rhizobium ecuadorense case study, from 1956 contigs, it reported 1606 scaffolds. From which only sixty were more extensive than a thousand base pairs. Most of our readers coming from a computational background would agree that this is no problem, and a couple of lines of code could have this scaffold’s fasta cleaned2_{. However, not all people will}

have this kind of knowledge, as a Biologist -we presume-, the reader might want to get a fasta file as ready to work as possible.

Further analysis reveals that some of those scaffolds are even smaller than 100 or 200 base pairs. We acknowledge the fact that those scaffolds might contain valuable information, but we do not believe they should be delivered within the final fasta file, as they may also contain

2_{In any given programming language (in which you can import the sequence files as strings) you can filter out}

(29)

garbage from the sequencing process. Besides, one of the restrains of MeDuSa’s algorithm is that input should only be closed or draft genomes. Although this is an improvement over those scaffolders that only accept one genome and those that require it closed, most of the datasets we found as references for the Rhizobium were in contig form.

Therefore, we needed a scaffold algorithm that could handle all kinds of inputs, took special care of misassemblies, and returned scaffolds, both ready to be used and as close as possible to the set M of the target genome. All this, while also returning those small contigs that might contain important information, to ensure we miss no data.

2.2.1 Data pre-process

Almost any project that handles raw data needs to make sure of the quality of its data. Although we can work with raw data, the resulting quality of such project might be less than desirable.

Data preprocessing is a crucial step in data analysis for any application that deals with raw data (OJEDA et al., 2014). Nonetheless, genome sequencing can generate millions of reads per run. However, drawing biological conclusions from raw data without preprocessing it, can lead to biases in reads and a negative impact on the usefulness of your conclusions. Moreover, data preprocessing is important even from the in vitro stage, where it can lead to elongation and sequencing errors (VRANCKEN et al., 2016). Furthermore, the genome assembly process is not an exception, as reads might carry many problems with them (EL-METWALLY et al., 2013) (such as biases, overrepresented sequences, low-quality base calls, substitutions, indels, and N’s). Therefore, neglecting these errors can lead not only to assembly errors, but also undetectable problems within the final assemblies.

For the sake of clarity, we divided the data pre-process section in two stages. First, we assess the quality of data to make sure it has no problems. Because, in case it has, we need to take the measures to try to get rid of them. We called this sub-section Quality assessment. Finally, we make sure no rotten eggs went into the recipe. We called this sub-section Data filtering.

Data pre-process: Quality assessment

High throughput sequence data generates millions of reads, and it is multiplying exponentially. With the quality assessment of this data in mind, various tools have been developed. One of the most widely used is called FastQC (ANDREWS et al., 2017), developed by Simon Andrews at the Babraham Institute in Cambridge. We chose it not only because its project is stable and its code mature, but also because some of the new quality control tools are -at some point- based on it. It has even been ported to the language R. Besides being easy to use, FastQC generates reports that can help us identify problems with our fastq files, originated in both the sequencer

(30)

or in the library material.

FastQC

FastQC is developed in the Java language. Its most recent version is 0.11.8; however, we used versions 0.11.4 and 0.11.5. We did not upgrade to the newer versions because they only assess compatibility with newer sequencing data -such as Oxford’s Nanopore reads- and minor bugs that do not affect our results. Also, starting from version 0.11.6, k-mer plots were disabled, and we think any information that could be gathered might be useful at some point, or for further studies( someone might find something we missed out).

Overrepresented sequences

In a given dataset, no particular sequence should be overrepresented, and if it is, it can have several explanations. Typical reasons for overrepresented sequences are contamination of the library, biological importance, or lack of diversity within the dataset. Some details about its inner workings (from FastQC’s documentation) should be clarified: For this module, FastQC only scans the first hundred thousand sequences, to avoid excessive RAM usage. Only se-quences longer than 20 bp are considered for this analysis (Sese-quences longer than 75 bp are trimmed to 50 bp). No more than one mismatch is allowed as a hit. If a sequence makes up more than 0.1% is considered as overrepresented. Overrepresented sequences match up with a database of common contaminants, in case that particular sequence is already known.

Within these observations, it is clear that the assessment of overrepresented sequences will help identify most problems with reads, however relying on this tool might not be suitable for all purposes, since it does not cover all the dataset. Although an N in Illumina’s reads means its software was unable to make a basecall, the presence of sequences of N’s is expected. When working with Illumina sequences, the first and last few thousand reads are expected to be of low quality. Because those reads originate from the edges of flowcells, thus making basecalls difficult. A common practice (to avoid low-quality reads) is to take out of the library the first and last 100000 reads. Instead of doing this, we decided to use a trimming tool, called Trim Galore. Trim Galore has an option for trimming N’s from reads, for more information on this subject, please refer to subsection 2.2.1.

Per base sequence content

This statistic assesses the percentage of each called DNA base (it only takes into consideration A, T, C, and G) at any position in the read sequence. Within a typical library, there should be little difference between the bases, making the plot lines look parallel to each other.

(31)

Nonethe-less, libraries produced by priming using random hexamers (e.g., RNA-Seq), and libraries frag-mented with transposases are expected to have a bias. Some points to have into consideration about its inner workings are: if the difference between any base is more than 10%, it will trigger a warning. If the difference between any base is more than 20%, it will fail.

According to FastQC’s documentation, the most common reasons for biases are over-represented sequences, which might bias the composition of bases. Biased fragmentation (Pro-duced by random hexamers or transposases). Biased composition libraries (Some libraries se-quence compositions are biased e.g., libraries treated with sodium bisulfite, thus converting most C’s into T’s). Excessive use of adapter trimming (which will remove sequences that match short stretches of adapters).

As above mentioned, there are four most common causes of this kind of biases. How-ever, most probably could be overrepresented sequences, as reported in the “overrepresented sequences” paragraph, above. Nonetheless, it is worth to note that the other three reasons (Bi-ased fragmentation, bi(Bi-ased composition of libraries and, excessive use of adapter trimming) are also probable, yet escape the scope of this project (since they depend on laboratory library production).

Per base sequence quality

This module generates a box plot of the distribution of quality values across all bases at each position in the library. These box plots quartiles are designated as follows: median is the central red line. The first and third quartiles: are the yellow boxes, or interquartile range (25-75%). Whiskers: Both upper and lower, represent 10% and 90% points (from which the higher whisker is 90%).

Additionally, the blue line represents the mean quality across base pairs. Although box plots are self-explanatory, this module will trigger a warning message if any quartile’s qual-ity goes under 10%, or if the median goes less than 25%. If any quartile is less than 5%, or the median is less than 20%, it will report a failure. Most common reasons for low quality (from FastQC’s documentation) are: Degradation of quality during long runs (due to sequenc-ing chemistry degradation), Loss of quality earlier in the run (sometimes happens because of bubbles passing through the flowcell), and Low coverage at a given base (within varying length reads).

Per sequence GC content

FastQC (ANDREWS et al., 2017) calculates the GC distribution over all sequences, by mea-suring the GC content of each sequence, across the whole library. It then compares it to a theoretical normal distribution, calculated with the modal GC content. A couple of notes from

(32)

FastQC’s documentation: the theoretical normal distribution is shown in blue, and the central peak is the overall GC content.

Within a standard library, the GC distribution should be close to that of the theoreti-cal normal distribution. It is worth to emphasize some of the implications of plots that differ too much from the theoretical normal distribution: divergences could indicate contaminated libraries and biases, shifted biases could indicate some sort of bias, independent of the base position, sharp peaks in a smooth distribution could indicate there is a specific contaminant, broader peaks may be due to the presence of a different species in the library.

Therefore, any anomaly in this module can be interpreted as a problem in the library.

Data pre-process: Quality assessment conclusions

As above reported, anomalies in any of the modules of FastQC, might indicate that a library has problems: Overrepresented sequences of N’s in both forward and reverse libraries, Bias in the sequence content of the first 18 bases, in both libraries, and Bias at the central peak of the per sequence GC content.

Data pre-process: Data filtering

For most software tools that deal with data, filtering has been reported of outstanding help to remove errors and noise. To filter low-quality reads, and trim poor quality bases, one of the most widely used tools is called Trim Galore (KRUEGER, 2015).

Before Trim Galore, there is another program we need to talk about: Cutadapt (MAR-TIN, 2011). Cutadapt is a program written by Marcel Martin, at the Mercator Research Center Ruhr, Germany. Cutadapt is an outstanding piece of software, written mainly in the Python language (ROSSUM, 1995), and using some C extensions to speed it up. It finds and removes: Adapter sequences, Primers, and Poly-A tails. We chose Trim Galore because it is a wrapper around Cutadapt and FastQC.

Finally, (although this definition might be oversimplified, computationally speaking) in computer science, a wrapper is simply a software (a program, tool) that contains another program within it. That means that Trim Galore is a program that builds its functionality around other programs (Cutadapt and FastQC). In other words, Trim Galore makes it easier for the user to use both FastQC and Cutadapt.

(33)

Trim Galore

Written by Felix Krueger in the programming language Perl (WALL et al., 1994), Trim Galore is a wrapper around Cutadapt and FastQC. It uses both Cutadapt and FastQC for adapter removal, trimming and, quality control. Besides, it adds some extra functionality (From Trim Galore’s documentation), which is organized in four steps: Quality Trimming (trims low-quality base calls from the 3’ end of the reads), Adapter Trimming (removes adapter sequences from the 3’ end of reads), Removal of short sequences (if the resulting sequences -from steps one and two-are shorter than 20bp, they two-are removed), and specialized trimming (such as the introduction of options like ’hardtrim’ to hard-clip sequences of any desired size from either end, i.e., 3’ or 5’).

2.2.2 Contig Assembly

With the advent of NGS, we had a huge volume of sequencing data. However, the size of the reads decreased from 500-600 bp to an average of 100-400 bp. Eugene Myers (Myers Jr, 2016) defines contig assembly as the tiling of reads into larger contiguous sequences (i.e., contigs, Figure 3). Furthermore, if the coverage is good enough, errors in reads can be corrected.

Figure 3: Contig assembly . From Myers: “The assembly problem is to reconstruct as much of a genome as possible given a collection of reads or read pairs”.

Source: (Myers Jr, 2016)

(34)

(WAJID; SERPEDIN, 2012). Dealing with hundreds of pieces -even with the reference image as a basis- we all know how difficult a jigsaw puzzle can be. Now, imagine having to solve it without previous knowledge (this is called de novo assembly). Furthermore, imagine having to solve it without a reference genome (image), and with thousands or even millions of pieces. That is genome assembly.

Because the size of reads of NGS is smaller than the smallest genomes (MILLER; KOREN; SUTTON, 2010), WGS over-samples the target genome reads to be able to reconstruct as much as possible of the original genome.

One of the problems of contig assembly is repeat resolution. When multiple regions of the genome share repeats, they become indistinguishable (MILLER; KOREN; SUTTON, 2010). Repeat resolution requires reads longer than repeats, however, when this is not possible, they can be resolved with paired-end sequencing techniques. Besides constraints of size, repeats are further complicated with the presence of errors, which can lead to false-positive joins (MILLER; KOREN; SUTTON, 2010).

There are numerous contig assemblers to date, and they can be divided into five schemes (WAJID; SERPEDIN, 2012): Overlap-Layout-Consensus, Alignment-layout-consensus, greedy approach, graph-based, and Eulerian.

Figure 4: Contig assembly software algorithms. Categorized by their scheme (Overlap-Layout-Consensus, Alignment-layout-consensus, greedy approach, graph-based, and Eulerian).

(35)

From Figure 4, we mention the following assemblers, based on their algorithm scheme (WAJID; SERPEDIN, 2012): Overlap-layout-consensus scheme are Genovo (LASERSON; JO-JIC; KOLLER, 2011), Sharcgs (DOHM et al., 2007), and Maximum Likelihood Genome As-sembly (MEDVEDEV; BRUDNO, 2009). Graph theory scheme based algorithms are: Velvet (ZERBINO; BIRNEY, 2008), AbySS (SIMPSON et al., 2009), Bidirected String Graphs for genome Assembly (MYERS, 2005), Edena (HERNANDEZ et al., 2008), Minimus (SOMMER et al., 2007), AllPaths (BUTLER et al., 2008), Taipan (SCHMIDT et al., 2009). Eulerian path approach scheme-based algorithms are: EULER (MULYUKOV; PEVZNER, 2001), De-novo assembly with A-bruijn graphs, and EULER-SR (CHAISSON; PEVZNER, 2008). Compar-ative assembly are AMOS-Cmp (POP et al., 2004), Gene boosted assembly (SALZBERG et al., 2008), and assisted assembly (GNERRE et al., 2009). Exhaustive approach has one assem-bler, the Exhaustive Genome Assembly (SHAH et al., 2004). Finally, the Greedy approach, SSAKE (WARREN et al., 2006), vcake (JECK et al., 2007) and QSRA (BRYANT; WONG; MOCKLER, 2009).

Among the many contig assemblers, one of the most used by the bioinformatics com-munity is the Velvet assembler. However, E+V-SC (based on the combination of Velvet and Euler) demonstrated an improvement over Velvet and Euler-SR. Although E+V-SC, SoapDeN-ovo (LUO et al., 2012), and Velvet are widely used by the community, recently SPAdes reported great advances compared to them (BANKEVICH et al., 2012).

2.2.3 SPAdes: contig assembly algorithm

In section 2.2, we defined the contig assembly problem as the problem of reconstructing con-tiguous sequences (superstrings) based on smaller reads. This is an oversimplification of what contig assembly is.

SPAdes (BANKEVICH et al., 2012) was initially conceived as a contig assembly tool for single-cell de novo assembly. Nonetheless, its generalization to standard bacterial datasets was introduced later on. A widely used technique for genome sequencing of single-cell is MDA. Multiple displacement amplification (MDA) is a non-PCR technique. It amplifies small samples of DNA to allow DNA sequencing. For this purpose, MDA anneals random hexamer primers to the template. However, MDA generates highly non-uniform coverage within the library. Therefore, SPAdes defines the assembly problem as follows: given a set of reads, reconstruct the original sequence (The set M of elements within the genome).

• Input: a set of reads (of length L 35-400bp) that could have been generated with MDA, and that could be paired-end.

(36)

With the following constraints: coverage thresholds from traditional contig assemblers cannot be used, insert size distribution is not normal, might be bi-modal, and there are more errors than in conventional libraries (e.g., chimeric connections).

Contig assembly is not a trivial task; its algorithms are complex and have multiple paths to converge at a contig solution. However, here, we will focus on the inner workings of SPAdes algorithm alone. As seen in Figure 5, SPAdes pipeline has four main steps: error correction, De Bruijn graph processing, Repeat resolution, and Postprocessing.

Figure 5: SPAdes pipeline.

Source: SPAdes (BANKEVICH et al., 2012)

We assume a certain level of knowledge from those coming from a computational background. We expect them to be familiar with graphs, clusters, subclusters, and to know what a k-mer is. We believe the following subsections might appear a little confusing for those that do not have a computational background. Therefore, if the reader finds the following lines hard to digest, we advise skipping to chapter 4.

SPAdes: error correction

In order to understand how SPAdes does error correction, we must first understand what BayesHammer (NIKOLENKO; KOROBEYNIKOV; ALEKSEYEV, 2013) is. An MDA library (e.g., a single-cell library) might have non-uniform coverage across the genome. Furthermore, its read sequences might have highly variable coverage. Thus, difficulting error correction in reads with low coverage. BayesHammer is a software that improves the functionality of

(37)

Ham-mer. Hammer was designed for the error correction of reads of single-cell datasets. To correct reads, Hammer splits reads into N k-mers (Figure 6) and then uses those k-mers to construct a Hamming graph. The idea is that a cluster of similar k-mers will generate better coverage in the central k-mer, thus, helping to solve ambiguous bases. The problem is that there might be more than one center in some clusters. To solve this, BayesHammer uses probabilistic sub clustering to get multiple centers in a cluster.

Figure 6: K-mers. K-mers are substrings of length k. In this figure, the genome S is divided in one 4-mer (ATTC) and all of its possible 3-mers.

Source: (LANGMEAD, 2017)

SPAdes split reads into k-mers, then uses BayesHammer to correct each k-mer (Figure 7). This process leads to a reduction in the number of k-mers. Depending on the library, it can reduce the number of k-mers to 20% of its original size. With the number of k-mers significantly reduced, it is now possible to create a De Bruijn graph.

SPAdes: De Bruijn graph processing

The actual contig assembly of SPAdes starts with the de Bruijn graph DBG (DBG) creation. After error correction and clustering of k-mers, SPAdes uses those k-mers to build several de Bruijn graphs. We will discuss further in this section how many and why it uses more than one de Bruijn graph. Within those graphs, it searches for connected components, and within those components, it starts the simplification process. The simplification process occurs based on the correction of common errors (See Figure 8) in its connected components; such errors are consequences of errors in reads. We provide a summary -from SPAdes- of those errors: indels ( miscalled bases in the middle of reads and small variations between repeats may lead

(38)

Figure 7: From BayesHammer: Reads correction. Grey k-mers indicate non-solid k-mers. Red k-mers are the centers of the corresponding clusters (two grey k-mers striked through on the right are non-solid singletons). As a result, one nucleotide is changed.

Source: (NIKOLENKO; KOROBEYNIKOV; ALEKSEYEV, 2013)

to bulges), errors close to the end of reads leads to tips, chimeric reads lead to chimeric h-paths (connections in the graph), low-quality reads that did not map to the genome generate short, low coverage, isolated h-paths.

It is worth to note that such errors are harder to correct in single-cell libraries since typical contig assemblers rely on coverage. Moreover, coverage is not uniform in these datasets. As above mentioned, we talked about several DBGs; these are called multi-sized de Bruijn graphs (Penget et al. 2010). To understand why SPAdes uses multi-sized de Bruijn graphs, let us provide a hypothetical genome example:

Given a genome of length L, we will create three de Bruijn graphs, based on multiple k-mer sizes: a small-sized k-k-mer (which we will call s-k-mer), a medium-sized k-k-mer (which we will call m-mer) and, a large-sized k-mer (which we will call l-mer). Components within the s-mer DBG are highly connected, but the complexity of its node connections is also high. Components within the l-mer DBG are relatively more straightforward; however, gaps are introduced. The m-mer DBG component’s properties are somewhere in the middle of both s-mer and l-mer DBGs.

Therefore, SPAdes uses multi-sized DBGs -varying the size of its k-mers, by default 21, 33, 55- to get the best of all DBGs.

(39)

[a]

[b] [c]

[d]

Figure 8: DBG errors. From SPAdes: “Selected features within a de Bruijn graph. The red h-path, P, is under consideration for deletion or projection to another path (bulge corremoval). The blue path(s), Q, are alternative paths. (A) A potential bulge. Q may contain hubs within it, though P does not. (B) A potential tip; h-path P starts or ends at a vertex of total degree 1, and there is an alternative h-path Q. (C) A potential chimeric h-path. There must be alternative h-paths Q1, Q2 both for the entrance and the exit to P. (D) h-path is a repeat. Note that P starts with a vertex of outdegree one and ends with a vertex of in-degree one and has no alternative h-path. These degree conditions differentiate it from (A, B, C).”

Source: (BANKEVICH et al., 2012)

SPAdes: Repeat resolution

Repeats make a significant percentage of genomes, and they are hard to solve for assemblers. They generate paths with multiple entrances and multiple exits (Figure 9). Although those paths generally are easier to solve, sometimes they become tangles (computationally harder to solve). To address this problem, SPAdes uses paired-end reads to create paired de Bruijn graphs. In the previous step, SPAdes created DBGs with k-mers, assigning each k-mer to a vertex of the graph. However, to create the paired de Bruijn graph, SPAdes assigns pairs of k-mers to each vertex and uses its distance as edges (Figure 10). Then it identifies all identical pairs of k-mers. The problem here is how to estimate distances between pairs of k-mers because insert size is not constant. To address this problem, SPAdes estimates the distances by clustering the distribution of existing insert sizes in the paired de Bruijn graph. The path is estimated using acquired information from the distribution of clusters of insert sizes.