Overlaps and assembly - Improving Contiguity and Accuracy in Genome Assembly

An index is a data structure built over a dataset to support fast queries.

Generally, an index should take far less space than the original dataset to be practical. For the purposes of using an index to speed up finding alignments, we build the index over the Rmaps. The queries we are interested in are finding related or similar Rmaps from the dataset.

An inverted index [49] is a concept used in search engines. An inverted index is used to find documents where a string occurs. This is essentially what we want from our index. A simple index for this purpose is a k-mer index, which associates a list of occurrences for each k-mer in the dataset.

Similarity or relatedness can be measured by the number of shared occurring k-mers between two strings.

Indexing inherently erroneous data is problematic. Indexing can be seen as an effort to summarize only important and informational parts of the data; there is no insight that can be gained by analyzing errors.

For optical maps, indexing is a relatively fresher area but still a re- quirement for efficient analysis. Compared to indexing nucleotide-based sequences, errors in optical maps entirely misrepresent the information. A missing or added cut site in an optical map is not an added or missing character in data but rather it changes the associated cut sites around the error.

For example, TWIN [45] uses a modified FM-index to help find alignments between Rmaps but has an intolerance to errors, which makes it unable to find some alignments [50]. Thus, more suitable indexing approaches are a necessity for optical mapping data.

5.3.1 Spaced seeds

Building an index over thek-mers of the Rmaps has the inherent problem of failing to capture non-contextual information about Rmaps. The fragment lengths depend on each other far more than characters in strings in general.

(l, k)-mers instead arek-mers that are extended beyond the strictkcharacter limit, such that the sum of its elements is at leastl and length is at leastk.

This way more local information is contained in each atomic unit.

A relatively common technique in nucleotide-based pattern matching is to use what are calledspaced seeds [51] to attempt to get around errors. Spaced seeds use a binary pattern that tells when to skip a character. Assuming a uniform probability of errors along a string, skipping a character reduces the likelihood of including an error in the seed.

This introduces the problem of designing a spacing pattern that most likely only skips errors. This is, in the general case, likely an impossible-to- solve problem, as knowing the pattern of errors would make error correction

5.3 Indexing 29

I 2 I 10 I 3 I I1 5 I

[2,10,3]

[2,10,3,1]

[2,13,6]

Figure 5.2: Examples ofk-mer, (l, k)-mer, and spaced (l, k)-mer.

trivial. However, some patterns do work better than others [42, 51]. Finding a good default seed can be done in brute force fashion by iterating over the set of possible seeds and choosing some local minimum.

Spaced (l, k)-mer similarly use a pattern to skip fragment lengths. To keep the Rmap representative of the original genome, the lengths of skipped fragments need to be added back to the next length of the fragment. An illustrative example of a spaced (l, k)-mer and its differences to bothk-mers and (l, k)-mers is given in Figure 5.2.

Small errors in the Rmaps, i.e. in the range of off-by-ones, may make it hard to find actually related Rmaps using the index. To further increase the error tolerance of the index, the spaced (l, k)-mers are also merged based on distance. Here the distance between two (l, k)-mers is defined as

D(A, B) =X

|A_i−B_i|.

Then similar (l, k)-mers can be merged up to some threshold tusing a simple union-find structure. However, special care needs to be taken to keep each set of merged (l, k)-mers within a set threshold. As a set of merged (l, k)-mers grows, the number of (l, k)-mers similar to at least one other (l, k)-mer in the set grows. A simple heuristic can be used to artificially

limit the size of the merge sets.

In order to compress the index down, a key observation needs to made, the Elmeri index is composed of three distinct parts: (a) the spaced (l, k)- mers, (b) (l, k)-mer merges, and (c) occurrence lists. Each of these parts requires different compression strategies as they inherently have very different characteristics.

The spaced (l, k)-mers are stored in the index to support the similarity queries. Given a spaced (l, k)-mers, the index is first used to find the corresponding part of index. This is wasteful in both the space it takes to

store every (l, k)-mer and the time it takes to flip through the index to find the correct (l, k)-mer. A common technique to avoid this bottleneck is to use a hashmap, which replaces storing explicit (l, k)-mers with the values of a hashing function. We can further improve the space usage by using a minimal perfect hashing function (MPHF) [52, 53] as we know the complete set of (l, k)-mers we want the index to cover.

The sets of (l, k)-mer merges are implicitly stored in the Elmeri index as pointers to the merged root occurrence list. This is already close to the smallest possible representation, as we need to be able to find each merged (l, k)-mer and the merged occurrence lists in the index. We use a more explicit strategy that is marginally smaller. We store a list which stores the index of the merge root occurrence list for each (l, k)-mer and use that to find the corresponding occurrence list given any (l, k)-mer in the index.

The occurrence lists pose the most classical of the three compression problems. Each list contains integers marking the indices of Rmaps where the (l, k)-mer occurs. Since storing each integer takes a static amount of bits regardless of content this takes up a lot of space. The textbook approach to this is to an encoding that can use a variable amount of bits based on content, ideally smaller integers use less space. Then further compressing the lists by storing only the differences between occurrences. We use StreamVByte [54]

encoding, which is fast and space-efficient.

Constructing the index can also be done space-efficiently, as only parts of the uncompressed index are ever needed to be in memory. This is essentially the same observation we already made with regard to compression. First, the spaced (l, k)-mers can be computed in parts and finally merged, keeping memory usage at a minimum. The hashing function can be constructed space-efficiently using existing tools. Lastly, the occurrence lists can be compressed in parts as they do not depend on each other.

5.3.2 Results

The main results from Paper III are summarized here. The method is implemented as a tool called Selkie. Table 5.1 shows how the construction time and space usage is improved over the Elmeri index. The indexes constructed by Elmeri and Selkie are mostly similar in content; they can answer exactly the same queries. For the purposes of this comparison, Elmeri was modified to not store or compute the positions of occurrences.

Even still, the compression scheme used in Selkie easily beats out Elmeri in space usage, using only a quarter of the memory to construct the index.

Compared to Elmeri, the construction is also faster due to improved implementation, e.g. the spaced (l, k)-mers are represented as actual numbers

5.3 Indexing 31 Dataset Method Runtime Peak memory

Ecoli1 Elmeri 1 min 39 s 210.71 MB

Selkie 23 s 1.16 MB

Ecoli2 Elmeri 39 min 5 s 8.83 GB Selkie 36 min 40 s 2.10 GB Human Elmeri 8 h 59 min 84.05 GB Selkie 3 h 50 min 21.04 GB

Table 5.1: Runtime and peak memory usage of index construction Selkie and Elmeri on three datasets, two E. coli, and one human.

Method Runtime Peak memory Precision Recall Valouev et al. 2515 s 4.08 MB 0.844 0.021

Selkie 113 s 4.20 MB 0.849 0.021

MalignerDP 1747 s 14.35 MB 0.562 0.111

OMBlast 30 s 271.6 MB 0.892 0.0003

Table 5.2: Runtime, peak memory usage, precision, and recall of the different tools on E. coli.

rather than strings. This leads to roughly halving the construction time, while also having to compress all parts of the index.

Comparisons to other overlap computation methods are shown in Ta- ble 5.2. In order to compare the correctness of the overlap computation, we show precision and recall numbers for the methods. Precision is the fraction of true positives to both true and false positives, i.e. showing how precise the method is at correctly reporting overlaps. While recall is the fraction of true positives to true positives and false negatives, i.e. how well the method is able to find correct overlaps from the data.

Using these metrics, we can tell that pre-filtering the dataset of Rmaps for potential overlaps with Selkie is able to retain all overlaps found using the Valouev et al. [44] method, which is the best available method for accurately computing overlaps between Rmaps. This accuracy does come with the cost of taking by far the most time, which we combined with the filtering scheme using Selkie is reduced immensely.

Chapter 6 Discussion

In this thesis, we have looked at the minimum required steps for genome and optical map assembly. Handling errors in data is a necessity when analyzing data that describes essentially unknown phenomena in biology.

Errors can be either corrected, as described in Chapter 3, or they can be tolerated with smart processing as described in Chapter 5.

We discussed the difficulty of genome assembly and the opportunities for using additional data to guide the assembly process. Incorporating additional data as constraints in assembly graph construction allows for large improvements in both contiguity and length in genome assembly while also being a drop-in replacement in assembly pipelines.

There are some interesting developments that could affect some basic level assumptions that we had to make at the beginning. Recently, wavefront alignment [55] is an efficient method for computing alignments between strings. With a fast enough alignment engine, assembly graph construction could be different from what we defined it as. However, it remains to still be seen if this direction only makes marginal improvements to existing pipelines.

In Paper I, we were limited by the technology of the time and only looked at correcting the reads that were state-of-the-art. However, more recent sequencing technologies [27] have been making the error rates much smaller, effectively producing long reads that are as accurate as short reads.

Not only does this basically make the error correction tools of the time useless, it completely changes how to approach error correction.

In Chapter 1, we also briefly looked at using the error correction method- ology to polish assemblies rather than reads. The work in this direction is out of the scope of this thesis.

While HGGA from Paper IV greatly improves the assembly results over Kermit inPaper II, the central problem it solves in its redefinition

of the guided genome assembly problem is its ability to support non-linear guide data. No results on how well this works exist as there is currently no implementation and crucially no existing method to efficiently turn any non-linear data to a hierarchy.

Assembly of optical maps is very much a less developed area compared to genome assembly. The work we presented inPaper IIIon more efficient overlap computation for optical maps could be leveraged to produce assembly graphs of optical maps and therefore potentially build an efficient assembly pipeline. However, as such work would have been out of the scope of the paper, any constructive discussion on the topic is not fruitful here.

References

[1] Yu Gyoung Tak and Peggy J. Farnham. “Making sense of GWAS:

using epigenomics and genome engineering to understand the func- tional relevance of SNPs in non-coding regions of the human genome”.

Epigenetics & Chromatin, 8(1):1–18, 2015. doi: 10.1186/s13072- 015-0050-4.

[2] Danielle Welter et al. “The NHGRI GWAS Catalog, a curated resource of SNP-trait associations”.Nucleic Acids Research, 42(D1):D1001–

D1006, 2013. doi:10.1093/nar/gkt1229.

[3] Eric S. Lander et al. “Initial sequencing and analysis of the human genome”.Nature, 409(6822):860–921, 2001.doi:10.1038/35057062.

[4] Harry Stack Sullivan. “Problems of personality; studies presented to dr. morton prince, pioneer in american psychopathology.”American Journal of Psychiatry, 83(3):605–607, 1927.doi:10.1176/ajp.83.3.

605.

[5] Julian Catchen, Angel Amores, and Susan Bassham. “Chromonomer:

a tool set for repairing and enhancing assembled genomes through in- tegration of genetic maps and conserved synteny”.G3 Genes Genomes Genetics, 10(11):4115–4128, 2020. doi:10.1534/g3.120.401485.

[6] Trevor Paterson and Andy Law. “Arkmap: integrating genomic maps across species and data sources”.BMC Bioinformatics, 14(1):1–10, 2013. doi:10.1186/1471-2105-14-246.

[7] Leena Salmela, Riku Walve, Eric Rivals, and Esko Ukkonen. “Accu- rate self-correction of errors in long reads using de bruijn graphs”.

Bioinformatics, 33(6):799–806, 2017.

[8] Riku Walve, Pasi Rastas, and Leena Salmela. “Kermit: linkage map guided long read assembly”.Algorithms for Molecular Biology, 14(1):1–

10, 2019.

[9] Riku Walve, Simon J. Puglisi, and Leena Salmela. “Space-efficient indexing of spaced seeds for accurate overlap computation of raw optical mapping data”.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(4):2454–2462, 2022. doi: 10.1109/

TCBB.2021.3085086.

[10] Riku Walve and Leena Salmela. “HGGA: hierarchical guided genome assembler”. BMC Bioinformatics, 23(1):1–17, 2022. doi: 10.1186/

s12859-022-04701-2.

[11] Arturs Backurs and Piotr Indyk. “Edit distance cannot be computed in strongly subquadratic time (unless seth is false)”. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing, pages 51–

58, 2015.doi:10.1145/2746539.2746612.

[12] David J. Bacon and Wayne F. Anderson. “Multiple sequence alignment”.Journal of Molecular Biology, 191(2):153–161, 1986.doi:10.

1016/0022-2836(86)90252-4.

[13] Lusheng Wang and Tao Jiang. “On the complexity of multiple sequence alignment.” Journal of Computational Biology, 1(4):337–348, 1994.

doi:10.1089/cmb.1994.1.337.

[14] David R. Kelley, Michael C. Schatz, and Steven L. Salzberg. “Quake:

quality-aware detection and correction of sequencing errors”. Genome Biology, 11, 2010.doi:10.1186/gb-2010-11-11-r116.

[15] Paul Medvedev, Eric Scott, Boyko Kakaradov, and Pavel Pevzner.

“Error correction of high-throughput sequencing datasets with non- uniform coverage”. Bioinformatics, 27(13):i137–i141, 2011.doi: 10.

1093/bioinformatics/btr208.

[16] Xiao Yang, Karin S. Dorman, and Srinivas Aluru. “Reptile: representative tiling for short read error correction”.Bioinformatics, 26(20):2526–

2533, 2010. doi:10.1093/bioinformatics/btq468.

[17] Giles Miclotte, Mahdi Heydari, Piet Demeester, Stephane Rombauts, Yves Van de Peer, Pieter Audenaert, and Jan Fostier. “Jabba: hybrid error correction for long sequencing reads”. Algorithms for Molecular Biology, 11(1):1–12, 2016.

[18] Lucian Ilie, Farideh Fazayeli, and Silvana Ilie. “HiTEC: accurate error correction in high-throughput sequencing data”. Bioinformatics, 27(3):295–302, 2010. doi:10.1093/bioinformatics/btq653.

[19] Leena Salmela. “Correction of sequencing errors in a mixed set of reads”. Bioinformatics, 26(10):1284–1290, 2010. doi: 10 . 1093 / bioinformatics/btq151.

References 37 [20] Jan Schr¨oder, Heiko Schr¨oder, Simon J. Puglisi, Ranjan Sinha, and Bertil Schmidt. “SHREC: a short-read error correction method”.Bioin- formatics, 25(17):2157–2163, 2009.doi:10.1093/bioinformatics/

btp379.

[21] Christina Boucher, Alex Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. “Variable-order de bruijn graphs”. InProceedings of the 2015 Data Compression Conference, pages 383–392, 2015.

[22] Pierre Morisse, Thierry Lecroq, and Arnaud Lefebvre. “Hybrid correction of highly noisy long reads using a variable-order de bruijn graph”. Bioinformatics, 34(24):4213–4222, 2018. doi:

10.1093/bioinformatics/bty521.

[23] Leena Salmela and Jan Schr¨oder. “Correcting errors in short reads by multiple alignments”.Bioinformatics, 27(11):1455–1461, 2011. doi:

10.1093/bioinformatics/btr170.

[24] Robert Vaser, Ivan Sovi´c, Niranjan Nagarajan, and Mile ˇSiki´c. “Fast and accurate de novo genome assembly from long uncorrected reads”.

Genome Research, 27(5):737–746, 2017. doi: 10.1101/gr.214270.

116.

[25] Leena Salmela and Eric Rivals. “LoRDEC: accurate and efficient long read error correction”. Bioinformatics, 30(24):3506–3514, 2014. doi:

10.1093/bioinformatics/btu538.

[26] Pierre Morisse, Thierry Lecroq, and Arnaud Lefebvre. “Long-read error correction: a survey and qualitative comparison”.bioRxiv, 2021.

doi:10.1101/2020.03.06.977975.

[27] Aaron M. Wenger et al. “Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome”.

Nature Biotechnology, 37(10):1155–1162, 2019.doi: 10.1038/s41587- 019-0217-9.

[28] Ergude Bao, Tao Jiang, and Thomas Girke. “Aligngraph: algorithm for secondary de novo genome assembly guided by closely related references”. Bioinformatics, 30(12):i319–i328, 2014.

[29] Heidi E. L. Lischer and Kentaro K. Shimizu. “Reference-guided de novo assembly approach improves genome reconstruction for related species”. BMC Bioinformatics, 18(1):1–12, 2017.

[30] Korbinian Schneeberger et al. “Reference-guided assembly of four diverse arabidopsis thaliana genomes”. Proceedings of the National Academy of Sciences, 108(25):10249–10254, 2011.

[31] Chen-Shan Chin et al. “Phased diploid genome assembly with single- molecule real-time sequencing”. Nature methods, 13(12):1050–1054, 2016.

[32] Mikhail Kolmogorov, Jeffrey Yuan, Yu Lin, and Pavel A. Pevzner.

“Assembly of long, error-prone reads using repeat graphs”. Nature Biotechnology, 37(5):540–546, 2019.

[33] Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. “Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation”.Genome Research, 27(5):722–736, 2017.doi: 10.1101/

gr.215087.116.

[34] Heng Li. “Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences”. Bioinformatics, 32(14):2103–2110, 2016.

doi:10.1093/bioinformatics/btw152.

[35] Gary Chartrand, Garry L. Johns, Kathleen A. McKeon, and Ping Zhang. “Rainbow connection in graphs”. Mathematica Bohemica, 133(1):85–98, 2008.

[36] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. “QUAST: quality assessment tool for genome assemblies”.

Bioinformatics, 29(8):1072–1075, 2013.

[37] Mos`e Manni, Matthew R Berkeley, Mathieu Seppey, Felipe A. Sim˜ao, and Evgeny M. Zdobnov. “BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes”. Molecular Biology and Evolution, 38(10):4647–4654, 2021.

[38] Steven L. Salzberg et al. “GAGE: a critical evaluation of genome assemblies and assembly algorithms.” Genome Research, 22:557–567, 2012. doi:10.1101/gr.131383.111.

[39] Dent Earl et al. “Assemblathon 1: a competitive assessment of de novo short read assembly methods.”Genome Research, 21:2224–2241, 2011. doi:10.1101/gr.126599.111.

[40] Veli M¨akinen, Leena Salmela, and Johannes Ylinen. “Normalized N50 assembly metric using gap-restricted co-linear chaining.”BMC Bioinformatics, 13:1–5, 2012.doi:10.1186/1471-2105-13-255.

[41] Miika Leinonen and Leena Salmela. “Optical map guided genome assembly”. BMC Bioinformatics, 21(1):1–19, 2020.

References 39 [42] Leena Salmela, Kingshuk Mukherjee, Simon J. Puglisi, Martin D.

Muggli, and Christina Boucher. “Fast and accurate correction of optical mapping data via spaced seeds”. Bioinformatics, 36(3):682–

689, 2019.doi:10.1093/bioinformatics/btz663.

[43] Kingshuk Mukherjee, Darshan Washimkar, Martin D. Muggli, Leena Salmela, and Christina Boucher. “Error correcting optical mapping data”.GigaScience, 7(6), 2018. doi:10.1093/gigascience/giy061.

[44] Anton Valouev, Lei Li, Yu-Chi Liu, David C. Schwartz, Yi Yang, Yu Zhang, and Michael S. Waterman. “Alignment of optical maps”.

Journal of Computational Biology, 13(2):442–462, 2006.doi:10.1007/

11415770_37.

[45] Martin D. Muggli, Simon J. Puglisi, and Christina Boucher. “Efficient indexed alignment of contigs to optical maps”. In Proceedings of the 14th International Workshop on Algorithms in Bioinformatics,

pages 68–81, 2014.doi:10.1007/978-3-662-44753-6_6.

[46] Martin D. Muggli, Simon J. Puglisi, and Christina Boucher. “Kohdista:

an efficient method to index and query possible rmap alignments”.

Algorithms for Molecular Biology, 14(1):1–13, 2019. doi:10.1186/

s13015-019-0160-9.

[47] Shiguo Zhou et al. “A clone-free, single molecule map of the domestic cow (Bos taurus) genome”.BMC genomics, 16(1):1–19, 2015.

[48] Paolo Ferragina and Giovanni Manzini. “Opportunistic data structures with applications”. InProceedings of the 41st Annual Symposium on Foundations of Computer Science, pages 390–398, 2000.

[49] Justin Zobel and Alistair Moffat. “Inverted files for text search engines”.ACM Computing Surveys, 38(2), 2006.doi:10.1145/1132956.

1132959.

[50] Alden King-Yung Leung, Tsz-Piu Kwok, Raymond Wan, Ming Xiao, Pui-Yan Kwok, Kevin Y. Yip, and Ting-Fung Chan. “OMBlast: alignment tool for optical mapping using a seed-and-extend approach”.

Bioinformatics, 33(3):311–319, 2017.doi:10.1093/bioinformatics/

btw620.

[51] Stefan Burkhardt and Juha K¨arkk¨ainen. “Better filtering with gapped q-grams”.Fundamenta informaticae, 56(1-2):51–70, 2003.

[52] Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger.

“Hash, displace, and compress”. In Proceedings of the 17th Annual European Symposium on Algorithms, pages 682–693, 2009. doi:10.

1007/978-3-642-04128-0_61.

No documento Improving Contiguity and Accuracy in Genome Assembly (páginas 37-50)