Barcode Designing Tools - Finding Conserved Regions

1.6 Finding Conserved Regions

1.6.5 Barcode Designing Tools

Based on the algorithms described above for finding conserved regions and keeping in consideration the primer properties, several barcode designing tools exist today. However it will be more appropriate to use the term primer design instead of barcode design because these tools concentrate more on primer design and the thermodynamic properties of primers pairs and give no measures about the discrimination capacity of the region amplified by the primer pairs. Most of the tools that exist today are easily usable in the context ofsensu strictobarcoding when we want to adapt standard barcode primers to a new clade. But they are less adapted for metabarcoding or environmental barcoding. In this section, we will have a detailed look on most of the important tools that have been designed for this purpose.

The first program inline is perhapsPRI MER(Primer 0.5) developed by Whitehead In- stitute/MIT Center for Genome Research but this program was never published. A complete rewritten version ofPRI MER(Primer 0.5) exists in the form ofPrimer3 (Rozen and Skaletsky,2000). It takes as input a single sequence and selects single primers or PCR primer pairs considering oligonucleotide melting temperature, length, GC content, primer-dimer possibilities, PCR product size and positional constraints within the source sequence. Primer3 also provides some objective functions to be computed for each primer pair. They include: checking each primer pair against a mispriming library ( which means that primer pairs should not amplify any of the non-target sequences specified in the mispriming library) and checking the primers for self-complementarity. Nevertheless the computation of objective functions increase the running time of the program. The most time expensive operation is to check each primer pair against a mispriming library. In this case Primer3 adopts a very rigorous approach of locally aligning each candidate primer against each library sequence and rejecting those primers for which the local alignment score exceeds a specified weight. Running time ofPrimer3 is also dependent on size of

input sequence and hence it is a linear function of sequence size. Primer3 is perhaps an ideal solution as it provides a lot of adjustable parameters and it is the most widely used program but since it seeks to amplify a single target sequence, it cannot be used to design universal primers amplifying a large number of target sequences. Moreover the overhead of alignment for excluding the non-target sequences amplification makes this program infeasible for large applications.

UniPrime(Bekaert and Teeling,2008),QPRI MER(Kim and Lee,2007), andPrimaclade (Gadberryet al.,2005) are three programs based on the alignment of multiple sequences to find the primer pairs.

UniPrime takesGenbank GenID of the target locus as input and selects the prototype sequence (mRNA sequence of longest isoform of gene). This prototype sequence is then used as a query sequence in Blastnsearch to search for all highly similar homologous sequences. Stored sequences are concatenated into a single file and then aligned using TCo f f eeprogram (Notredameet al.,2000). From this alignment a consensus sequences is inferred and all possible primers along the consensus sequence are generated byPrimer3.

QPRI MERis a web-based application that designs conserved PCR and RT-PCR primers from multiple genome alignment making use of a genome browser (Pygr) andPrimer3 programs. Pygr⁶ (Python Graph Database Framework for Bioinformatics) is an open source program that allows sequence and comparative genomics analyses. It can query large sequence databases or multiple genome data sets to find regions of interests.

QPRI MERsupports human, mouse, rat, chicken, dog, zebrafish and fruit fly sequences to design primers. This program allows its users to browse a specific gene of interest using genome browser based on genomic location. Users can select any region in the gene structure as a target for amplification. For the selected gene region, QPRI MER usesPygrto extract the sequence from multiple alignment dataset. It then usesPrimer3 program to design primer pairs from the extracted data set.QPRI MERselects primers from only exonic regions. The major disadvantage of such primer design approach is that primers are selected only from a single sequence and non-target sequences are not allowed. Moreover this type of application can only be useful for selecting primer pairs from standard genes which are known to be conserved.

Primacladeis also a web-based application and is based on multiple genome alignment to infer conserved regions. It accepts a multiple species nucleotide alignment file saved as Clustal (Thompsonet al., 1997), EMBOSS (Riceet al.,2000) or any other alignment format as input and identifies a set of degenerate PCR primers that will bind across the alignment. To select the primer pairs, Primacladecomputes a consensus sequence

6(http://bioinfo.mbi.ucla.edu/pygr/)

from the alignment file. It then splits the alignment file into individual sequences and usesPrimer3 program to compute a set of exhaustive primer pairs for each individual sequence from alignment file. To compute a large number of Primers,Primacladeruns Primer3 program eleven times for each sequence starting from primer length of 18 bp and increasing the length by 1 bp each time eventually terminating at primer length of 28 bp.

After generating primer pairs for each sequence from the alignment file, it compares them to the corresponding nucleotides in consensus sequence to see if consensus sequence contains the correct number or fewer degenerate nucleotides. In this case primer is saved otherwise the pair is discarded.

This approach of primer design based on alignment is although effective, it provides a limited number of barcode markers, because only those conserved regions are identified which are specific to a particular gene.

Some programs have been designed in the context of environmental applications like Greene SCPrimer(Jabadoet al.,2006) and PrimerHunter(Duitamaet al., 2009). Both of these programs have been designed for PCR detection of viruses which aresensu lato barcoding applications. Greene SCPrimeris also based on the processing of a multiple sequence alignment. It determines the optimum primer pairs from a nucleic acid sequence alignment by first constructing a phylogenetic tree to identify candidate primers and then using a greedy algorithm to identify minimum set of primers that amplifies all members of alignment. The exact algorithm is as follows. From a multiple alignment of sequences, sub-alignments of length appropriate for PCR primers are extracted and only unique strings are kept for further processing. For the short sequences that are kept, a similarity matrix is generated using pairwise alignment. This similarity matrix is used to generate a phylogenetic tree using a hierarchical clustering algorithm based on Euclidean distance using an open source clustering library (de Hoonet al.,2004). At each node of the phylogenetic tree, a consensus sequence is computed and then primers are checked and filtered for physical constraints likeTm,GCcontent and degeneracyetc. After scoring the primers, greedy algorithm is performed to keep only a minimum number of primers which amplify all the sequences in the alignment and the last step is to identify primer pairs with matchingTmsuitable for amplifying products of a specific size range. The time complexity of tree building step isO(n³), primer scoring step is linear in time, however the third step of primer minimizing has complexity ofO(nlogn). The last step of building primer pairs is linear in time.

The other program for designing PCR primers for viruses isPrimerHunter(Duitamaet al., 2009). This program has been designed to select highly sensitive and specific primers for virus sub types. The tool takes as input two fasta files, one containing the target

sequences and the other containing non-target sequences. Primers are selected such that they efficiently amplify any one of the target sequence and none of the non-target sequences. The program uses some hash tables to build primers making sure that 3’ end does not allow mismatches.Primer Hunteruses nearest neighbor thermodynamics model ofSantaLucia and Hicks(2004) for calculating accurate melting temperature.

Although bothGreene SCPrimerandPrimerHunteraresensu latobarcoding applications, the efficiency of both programs is a big question mark. Greene SCPrimer is based on the processing of alignment and constructs phylogenetic tree which are very expensive computations in time and hence for large sequences this program cannot be efficiently used. PrimerHunter is based on thermodynamics model and it also needs to do a lot of computation to see if a primer pair should be selected or not, and hence again for small sequences this program is good but for larger sequence databases, it is not efficient enough. A comparison of the main features of some important existing primer and probe selection tools is given in (Duitamaet al.,2009).

No documento Tiayyba Riaz (páginas 44-47)