Chapter 2 4Pipe4 – A 454 data analysis pipeline for SNP detection in datasets with no
4.3 Validation
In order to assess the efciency of SNP detection and the rate of false positives, and assess the best default values to use, an approach using reference data was used.
For this goal, two reference sequences of two E. coli strains were used ( http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ADWQ01 - Strain 85 and http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ADWR01 - Strain 79). Two 454 datasets were also downloaded from the NCBI Sequence Read Archive (SRA) (Leinonen, Sugawara, & Shumway, 2011) (http://www.ncbi.nlm.nih.gov/sra/SRX036805 and http://www.ncbi.nlm.nih.gov/sra/SRX036804) for the same strains as the references.
To assess the number of SNPs between the strains that could be found on the 454 datasets, the 454 reads of one strain were mapped against the reference sequences of the other strain using bowtie2 (Langmead & Salzberg, 2012). Atlas-SNP2 (Shen et al., 2010) reported 29673 SNPs between the reference sequence '85' and the 454 reads of
the strain '79', and 28525 SNPs between the reference sequence '79' and the 454 reads of the strain '85'.
Figure 2.1: 4Pipe4 fowchart. The rectangular shapes represent processes, the rhomboid shapes represent input/output fles. The dashed arrows represent optional steps. The names inside square brackets are the names of the used external programs. The digits on the top right corner of each rectangle represent the step number of each process.
4Pipe4 was then run on the two merged 454 datasets, discarding all strain information.
Although this validation method is not as good as true wet-lab genotyping, it is likely to be a good proxy, since Atlas-SNP2 is known to have very high sensitivity and specifcity when dealing with 454 datasets (Shen et al., 2010).
The results varied with the diferent tested parameters (Table 2.1), but the best output was obtained with the default values of minimum coverage of 15 and minimum average quality of 70 per variant. This setup retrieved 114 SNPs, of which 32 did not match to any of those detected by Atlas-SNP2, being thus, considered false positives (28.07% false discovery rate).
Table 2.1: Obtained and validated SNPs per parameter set. Of the six tested parameter combinations, the lowest false positive rate was retrieved with the default values: 15 Minimum coverage and 70 Minimum average quality.
Parameters used (Min. Coverage|Min. Average Quality)
10|60 10|70 15|60 15|70 (Default) 15|75 20|70
Total SNPs retrieved 234 169 155 114 107 89
Confrmed SNPs 97 86 88 82 69 57
False Positive rate (%) 58.55 49.11 43.23 28.07 35.51 35.96
Although the number of provided SNPs is relatively low, due to the restrictive assembly and fltering parameters, we fnd this a good trade-of relative to the high confdence of the retrieved SNPs.
The task of SNP calling in 454 data has been performed before on organisms without a reference sequence or strain information, with varying degrees of false positives. One such study, conducted using custom scripts for SNP calling provided a false positive rate of 80% on 4200 retrieved SNPs (Tollenaere et al., 2012). Another example, where the contigs of 283 SNPs were manually screened and selected, had a slightly better false positive rate of 45% (Broders, Woeste, San Miguel, Westerman, & Boland, 2011).
The above mentioned studies are not directly comparable to the results of the benchmark performed here, since they are performed on diferent datasets, nevertheless they can be used to infer that, in general, 4Pipe4 retrieves a smaller number of SNPs than other methods, but with a considerably lower false positive rate.
Since the main goal of this pipeline is to provide the user with high confdence SNPs for genotyping arrays, a rate of 28.07% false positives is a considerable improvement relative to the other mentioned approaches.
4.4 4Pipe4 compared to other software
Although 4Pipe4 is specifcally designed for the purpose of detecting variation when no strain information or reference sequence is available, other software exists that can be used for the same purpose, but which difers from 4Pipe4 in some aspects:
QualitySNP (Tang, Vosman, Voorrips, van der Linden, & Leunissen, 2006) – Relies on CAP3 for clustering the reads (which is optimized for Sanger sequences, while 4Pipe4 uses mira, which is optimized for NGS data). Requires perl, PHP, a confgured webserver and a MySQL database for SNP retrieval. This means that root access to the machine in which the program is being run on is required. Furthermore, QualitySNP has been superseded by the simpler and faster QualitySNPng (Nijveen, Kaauwen, Esselink, Hoegen, & Vosman, 2013).
AGSNP (You et al., 2011) – Relies on Newbler assembler for clustering, and if strain information is not available, it further requires combining 454 data with Illumina or SOliD data (4Pipe4 does not require multiple technologies data for SNP calling).
Still other programs exist for SNP detection, but they usually require either a reference sequence, such as Atlas-SNP2 or SAMtools, or strain information, such as discoSnp++
(Uricaru et al., 2014) (formerly kisSnp (Peterlongo et al., 2010)) or DIAL (Ratan, Zhang, Hayes, Schuster, & Miller, 2010).
There is, however, another program that can be used for the same purpose as 4Pipe4 – QualitySNPng. This program, however is not an analysis pipeline, but rather a SNP caller for read alignments. It has a graphical user interface, which can be disabled for use in servers, but still requires “Qt4” to be installed in the server (which is not frequent). In order to compare it with 4Pipe4, we have modifed the program to be usable without
“Qt4” installed (https://github.com/StuntsPT/QualitySNP) and provide a branch of
4Pipe4 which is ready to use QualitySNPng
(https://github.com/StuntsPT/4Pipe4/tree/new_snp_caller), without requiring any further dependencies.
Benchmarking the results of 4Pipe4 with QualitySNPng as the SNP caller, more SNPs were returned (513|147 SNPs found with the default|tuned values) than with our SNP caller, but with a larger rate of false positives (only 60|22 SNPs were a match to those found by AtlasSNP2, meaning a false positive rate of 88.3%|85%). Therefore, the builtin SNP caller was kept as default, but QualitySNPng can still be used from its own git branch if desired.
For the sake of completeness, we also made the SNP calling on the benchmark dataset using the software discoSnp++ (which requires strain identifcation) with the most restrictive parameters, to minimize the number of false positives. This program retrieved 9226 SNPs, of which 5967 were considered true positives (false positive rate of 35.3%).
As expected, this method retrieves more SNPs than both 4Pipe4 and QualitySNPng since it takes advantage of strain information, but it still provides a somewhat higher false positive rate than 4Pipe4.
5 Conclusions
We present here an automated analysis process specifcally designed for SNP detection from 454 pyrosequencing transcriptome reads, which we named 4Pipe4. This is the frst program specifcally built to automate the whole process of fnding putative SNPs in NGS datasets that lack both information regarding the origin of each read and a reference sequence. In-silico validation of 4Pipe4 results using previously analysed reference data revealed good performance in the calling of high confdence SNPs.
The 4Pipe4 pipeline, at the cost of retrieving a relatively low number of SNPs, has provided a lower rate of false positive SNPs than both an alternative SNP caller (QualitySNPng) and an alternative software that uses strain information (discoSnp++), as well as those obtained in previous studies that used diferent approaches for a similar type of data and goal.
Since the main purpose of this software is to retrieve high confdence SNPs for further exploring, we expect the incremental contributions it brings to improve, speed up and facilitate research on the feld of population genomics.
Furthermore, we expect to implement new features in 4Pipe4, such as: graphics in the metrics report; indel variation fnding; integration of alternative software (such as newbler for assembling instead of mira); process optimization for NGS technologies besides 454; switch from FASTA + FASTA.QUAL format to FASTQ. These are some of the planned features, but others can be requested and implemented, should there be demand for them.
6 Authors' contributions
FPM has developed the software, drafted the SNP detection routines and written the manuscript, BMV has extensively reviewed the code and assisted in the initial data analyses, SGS has provided valuable insights on SNP data interpretation and proofread the manuscript, DB has provided the samples and the datasets for the initial data analyses, and proofread the manuscript and OSP has conceived of the study, and participated in its design and coordination.
7 Acknowledgements
This work was fully supported by projects SOBREIRO/0036/2009 (under the framework of the Cork Oak ESTs Consortium), PTDC/BIA-BEC/098783/2008, and PTDC/AGR-GPL/119943/2010 from Fundação para a Ciência e Tecnologia (FCT) – Portugal. F. Pina-Martins was funded by FCT grant SFRH/BD/51411/2011, under the PhD program “Biology and Ecology of Global Changes”, Univ. Aveiro & Univ. Lisbon, Portugal.
D. Batista was funded by FCT grant SFRH/BPD/104629/2014. We would furthermore like to thank the reviewers of the manuscript, whom have provided very constructive criticism, resulting in great improvements to both the manuscript and the software.
8 References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). BASIC LOCAL ALIGNMENT SEARCH TOOL. Journal of Molecular Biology, 215(3), 403–410.
Broders, K. D., Woeste, K. E., San Miguel, P. J., Westerman, R. P., & Boland, G. J. (2011). Discovery of single-nucleotide polymorphisms (SNPs) in the uncharacterized genome of the ascomycete Ophiognomonia clavigignenti-juglandacearum from 454 sequence data. Molecular Ecology Resources, 11(4), 693–702.
doi:10.1111/j.1755-0998.2011.02998.x
Chevreux, B., Pfsterer, T., Drescher, B., Driesel, A. J., Muller, W. E. G., Wetter, T., & Suhai, S. (2004). Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in
sequenced ESTs. Genome Research, 14(6), 1147–1159.
Conesa, A., Gotz, S., Garcia-Gomez, J. M., Terol, J., Talon, M., & Robles, M. (2005). Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics (Oxford, England), 21(18), 3674–3676.
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–9.
Leinonen, R., Sugawara, H., & Shumway, M. (2011). The sequence read archive. Nucleic Acids Research, 39(Database issue), D19-21.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078–9.
Modesto, I. S., Miguel, C., Pina-Martins, F., Glushkova, M., Veloso, M., Paulo, O. S., & Batista, D. (2014).
Identifying signatures of natural selection in cork oak (Quercus suber L.) genes through SNP analysis.
Tree Genetics & Genomes, 10(6), 1645–1660. doi:10.1007/s11295-014-0786-1
Nijveen, H., Kaauwen, M. van, Esselink, D. G., Hoegen, B., & Vosman, B. (2013). QualitySNPng: a user-friendly SNP detection and visualization tool. Nucleic Acids Research, 41(W1), W587–W590.
doi:10.1093/nar/gkt333
Papanicolaou, A., Stierli, R., Ffrench-Constant, R. H., & Heckel, D. G. (2009). Next generation transcriptomes for next generation genomes using est2assembly. BMC Bioinformatics, 10, 447.
Peterlongo, P., Schnel, N., Pisanti, N., Sagot, M. F., & Lacroix, V. (2010). Identifying SNPs without a Reference Genome by Comparing Raw Reads, 6393, 147–158.
pysam-developers/pysam. (n.d.). Retrieved March 13, 2015, from https://github.com/pysam-developers/pysam
Ratan, A., Zhang, Y., Hayes, V. M., Schuster, S. C., & Miller, W. (2010). Calling SNPs without a reference sequence. BMC Bioinformatics, 11, 130.
Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European molecular biology open software suite.
Trends in Genetics : TIG, 16(6), 276–277.
Savage, A. E., Kiemnec-Tyburczy, K. M., Ellison, A. R., Fleischer, R. C., & Zamudio, K. R. (2014). Conservation and divergence in the frog immunome: pyrosequencing and de novo assembly of immune tissue transcriptomes. Gene, 542(2), 98–108. doi:10.1016/j.gene.2014.03.051
Schuster, S. C. (2008). Next-generation sequencing transforms today’s biology. Nature Methods, 5(1), 16–
18.
Sequence Cleaner. (n.d.). Retrieved March 13, 2015, from http://sourceforge.net/projects/seqclean/
Shen, Y., Wan, Z., Coarfa, C., Drabek, R., Chen, L., Ostrowski, E. A., … Yu, F. (2010). A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Research, 20(2), 273–80.
Tang, J., Vosman, B., Voorrips, R. E., van der Linden, C. G., & Leunissen, J. A. M. (2006). QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species. BMC Bioinformatics, 7, 438.
The UniVec Database. (n.d.). Retrieved March 13, 2015, from http://www.ncbi.nlm.nih.gov/tools/vecscreen/
univec/
Tollenaere, C., Susi, H., Nokso-Koivisto, J., Koskinen, P., Tack, A., Auvinen, P., … Laine, A.-L. (2012). SNP Design from 454 Sequencing of Podosphaera plantaginis Transcriptome Reveals a Genetically Diverse Pathogen Metapopulation with High Levels of Mixed-Genotype Infection. PLoS ONE, 7(12), e52492.
doi:10.1371/journal.pone.0052492
Uricaru, R., Rizk, G., Lacroix, V., Quillery, E., Plantard, O., Chikhi, R., … Peterlongo, P. (2014). Reference-free detection of isolated SNPs. Nucleic Acids Research, 43(2), e11–e11. doi:10.1093/nar/gku1187
You, F. M., Huo, N., Deal, K. R., Gu, Y. Q., Luo, M.-C., McGuire, P. E., … Anderson, O. D. (2011). Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics, 12, 59.