Main results fromPapers IIandIVare summarized here. Table 4.1 shows how both methods improve the assembly over miniasm, which does not use the genetic linkage maps.
The assemblies produced by miniasm, Kermit, and HGGA were all evaluated using QUAST [36] and BUSCO [37]. BUSCO uses a set of genes that it maps to the assembly and reports, among other things, the number of complete genes it can map. Whereas QUAST provides an exhaustive list of statistics on the quality of assemblies.
The NGA50 statistic is decades’ worth of compromises in an effort to find a good summary statistic for genome assemblies. The original N50 statistic is defined as the shortest contig length such that half of the assembly is covered by contigs with a length of at least N50. Multiple proposed improvements later [38, 39, 40], QUAST has settled on using a combination of NG50 and NA50 called NGA50. NG50 (G for genome) uses the length of the genome rather than the assembly and NA50 (A for aligned) first aligns the assembly to a reference and then splits each contig at a misassembly.
The two main contiguity statistics, the number of contigs and NGA50, define the improvement HGGA brings to the assembly space. While Kermit only improves on these over miniasm, HGGA brings up to a 10x uplift to NGA50.
Genome fraction and reads mapped give similar viewpoints to assembly correctness. Genome fraction represents the fraction of the reference assem- bly found in the produced assemblies. Reads mapped is the percentage of reads that can be accurately mapped back to the assemblies. They both look at how much information is contained and preserved in the assembly.
There are only minor differences in these numbers as both of our methods make heavy use of miniasm which is very good at retaining the genome in the produced assembly.
Misassemblies are larger structural differences between a reference and the assembly, e.g. a section of the genome is moved somewhere else by using an erroneous overlap. Generally both methods improve upon the misassembly rate of miniasm. Kermit is easily the best of all three since the graph cleaning should in theory remove any possibility of making a misassembly. HGGA however, has a similar misassembly rate as miniasm,
4.5 Results 23
Method # of NGA50 Genome Misas- BUSCO Reads Runtime Peak
contigs (bp) fraction semblies Complete Mapped (min) memory
(%) (%) (MB)
C. elegans
miniasm 126 1,982,361 99.443 10 98.1 99.75 20 18,332
Kermit 83 2,819,353 99.535 7 98.3 99.75 23 19,578
HGGA 31 5,901,436 99.595 14 97.2 99.78 51 1,334
A. thaliana
miniasm 712 2,552,623 98.766 346 84.5 96.63 2.37 34,128
Kermit 123 2,552,489 98.185 174 85.1 89.07 2.08 34,486
HGGA 136 4,173,314 98.247 242 86.3 95.87 3.41 10,050
H. sapiens
miniasm 8,789 692,902 89.761 3,669 76.5 61.37 237.84 565,309 Kermit 4,503 1,050,164 90.069 762 77.9 60.65 239.29 565,307 HGGA 2,204 6,814,538 93.181 3,004 86.5 70.45 37.46 69,492
Table 4.1: Comparison of HGGA, miniasm, and Kermit on data sets with real PacBio long reads and simulated genetic linkage maps. The best entries in each column are shown in bold.
potentially due to either relying on miniasm for leaf assemblies, being too aggressive at merging contigs, or some combination of both.
Kermit and HGGA both heavily rely on minimap and miniasm and thus all three share very similar constraints on time and space. However, neither method add any significant overhead and notably HGGA gives a large reduction in memory usage due to the way it splits reads and limits the number of possible overlaps to compute.
Chapter 5
Optical Mapping
Out of the various possible different types of data that could be used for guided assembly, the closest relative to reference genomes and genetic linkage maps is the optical map. All three provide an inherently linear guide for assembly; reference genomes and linkage maps by way of an assembled sequence to align reads to, and optical maps due to them representing a full genome sequence as lengths between known substrings of the genome.
Optical maps are created by cutting pieces of DNA at positions defined by an enzyme into fragments. The restriction enzyme reacts to a known sequence of DNA, breaking down the sequence. The lengths of the resulting fragments are then optically estimated. The resulting fragment lengths are then all collected into Rmaps, or restriction maps. An example of a restriction map is given in Figure 5.1.
Similar to reads in nucleotide-based methods, we have Rmaps as the intermediate between the biological process and computational methods.
Rmaps have errors that need to be overcome and they need to be assembled to form a full genome-wide optical map. The errors Rmaps contain are either missing or added cut sites and potentially wrong fragment lengths.
Analysis of optical maps mirrors that of nucleotide-based genome analysis.
Optical maps represent the genome in a different, notably larger alphabet, using fragment lengths instead of nucleotides. Many ideas can be brought over to optical map analysis.
Alignments between two Rmaps can broadly speaking be done just as with any other string however, they do have an entirely different error model and a distinctively different alphabet. For example, an added cut site means subtracting the length from a neighboring cut site, rather than simply adding a character with some alignment score cost.
Given a restriction sequence, we can simulate the cutting process in silico. We simply count the number of characters between occurrences of
25
ACGT ACGT ACGT ACGT
I 4 I 14 I 3 I
0 4 18 21
Figure 5.1: Example of restriction map, where the genome (top) is cut into fragments at occurrences of ACGT. Cut sites (bottom) are equal to the cumulative sum of fragment lengths (middle).
the restriction sequence. This way we can also align an Rmap to nucleotide sequences or vice versa. This is useful in using optical maps as guide data for assembly [41].
Using these building blocks, we can essentially reformulate all of the techniques discussed in this thesis into analyzing optical maps. First, we will look at error correction and assembly in the optical map alphabet. We will then discuss the problems and solutions of efficient indexing and how it is used to make analysis faster.
5.1 Error correction
Just as with read error correction, correcting Rmaps means approximating multiple sequence alignment. While the ideas behind are very similar to nucleotide-based analysis, a recurring theme emerges in optical maps; the area is not as widely researched and only a few methods compete with each other with wild trade-offs. There are two existing methods for error correcting Rmaps: Elmeri [42] and Comet [43]. They both offer similar ideas when it comes to the consensus.
Elmeri uses essentially the iterative MSA model from Section 3.2, con- structing a consensus sequence by iteratively aligning related Rmaps to the consensus. Rather than directly aligning sequences using costly Rmap alignment, Elmeri transforms the Rmaps into binary sequences and uses a simpler, character-based alignment.
Comet finds all pairwise alignments of related Rmaps and constructs a multiple alignment grid, where it stores decompositions of the Rmaps and their alignments as pairs of match counts between Rmaps. This reduces the number of Rmap-to-consensus alignments compared to an iterative MSA at the cost of having a large grid to compute around.
Elmeri provides an improvement over Comet by using a fast indexing scheme to find related Rmaps and a cheaper MSA approximation. The difficulty of extrapolating a competitive area of research from these two examples is the difficulty of analyzing methods for optical map analysis.