• Nenhum resultado encontrado

Complete Formalization Of B c And B s

No documento Tiayyba Riaz (páginas 80-84)

In the article above the complete mathematical formalization of Bc and Bs indices is missing. It is detailed in this section. For this purpose we need to define some sets and relations. As shown in the Figure 1 of publication above, we define following sets:

T ={ti} The set of all taxa.

I ={idi} The set of all individuals.

B ={bi} The set of all barcode sequences.

R={ri} The set of all barcode regions.

L={li} The set of all taxonomic levels (ranks).

And we define following relations on these sets:

E:T 7→ I Membership relation of an individual to a taxon.

EL:L 7→ T Membership relation of a taxon to a taxonomic level.

E :R 7→ B Membership relation of a barcode to a region.

Img:I 7→ B Gives barcodes identifying an individual.

The set of all taxa amplified by the regionr detectable by the primer pair defining this region are given by:

β(r)≡E1(Img1(E(r))). (2.3.1) SinceEL(l)gives the set of taxa belonging to a taxonomic levell, so finally we denote taxa of this taxonomic level amplified by the regionras:

α(r,l)≡ β(r)∩EL(l). (2.3.2)

2.3.1 Complete Formalization OfBc

The coverage index as defined in above article is the ratio of total number of amplified taxa to the total number of taxa of the same taxonomic level in the input data set. The computation of this index is only possible if the taxonomic content of the data set is fully

defined.

From the above relations we defineBc:R × L 7→Rthe fraction of taxa of a taxonomic levelldetectable by the primer pair defining the regionr.

Bc(r,l) = |α(r,l)|

|EL(l)| . (2.3.3)

Following definition2.3.3, identifying the best region in term of coverage corresponds to problem1

Problem 1.

find r as Bc(r,l)is max .

2.3.2 Complete Formalization OfBs

In the above publication, barcodespecificity(Bs)is defined as the ability of a region to discriminate between two taxa, or the ability of a region to unambiguously identify a taxon. We further said that a taxon is unambiguously identified if it owns a barcode region that is not shared by any other taxa of the same taxonomic rank. In order to compute the number of unambiguously identified taxa, we need to define some more relations.

Using above sets and relations, we define:

Ω(t,r) = Img(E(t))∩E(r), (2.3.4) whereΩgives us the set of all barcodes of a regionridentifying individuals of a taxon t. And inversely the set of all individuals (may belong to multiple taxa) identified by a barcode of regionris given as:

Img1() =[

i

Img1(bi |bi ∈Ω). (2.3.5) We said that a taxontis unambiguously identified (or well identified) by a barcode region rif and only if

Img1((t,r)) =E(t). (2.3.6) If we denote the above set of well identified taxa byǫas:

ǫ≡ {t|equation2.3.6holds}, (2.3.7)

then the specificityBsof a regionrfor a taxonomic levellis given as:

Bs≡ |{t|tǫ}|

|α(r,l)| . (2.3.8)

Following this definition, identifying the best region in term of specificity corresponds to problem2

Problem 2.

find r as Bs(r,l)is max .

2.3.3 Extending The Definition OfBs

The strict equality between left and right sides of equation2.3.6gives rise to a potential problem of falsely decreasing value ofBs. Looking at Figure 1 of article, we can see that taxaT2andT3are ambiguous because barcode sequenceb4is shared between individuals of these taxa. This reduces the specificity value to 1/3 because only 1 taxon is well- identified out of 3. There may be two potential reasons for individualI6to ownb4: first, this individual shares its barcode sequences with other taxa, rendering both the taxa sharing the same barcode, as not well-identified. But a second hypothesis that has to be considered is, since public data bases likeGenbankcontain many errors in taxonomical annotation, it is quite possible that this individual I6actually belongs to the other taxa T2. This misassigned taxonT3makesT2ambiguous. This second hypothesis results in a decreased value of barcodespeci f icity. In order to tackle this problem and not to falsely decrease the specificity we can extend the definition of barcode specificity to allow some errors in annotation. We say that a taxontis identified by a barcode regionrallowing aQ false positive errors rate if and only if

E(t)⊆ Img1((t,r)) and |Img1((t,r))∩E¯(t)|6Q|Img1((t,r))|

EQ(t,r). (2.3.9)

This defines a mappingEQ fromT toR. This mapping has two conditions: i) E(t) ⊆ Img1((t,r)), which means that the barcodes of regionridentifying the individuals of taxont may also identify individuals of some other taxa. ii)|Img1((t,r))∩E¯(t)|6 Q |Img1((t,r))|, this condition means that the number of individuals identified by barcodes of regionr not belonging to taxont are not more thanQpercent of the total individuals identified byr. If these two conditions hold then extended definition ofBsis given as:

Bs(r,l,Q)≡ |{t |t ∈ EQ(t,r)}|

|α(r,l)| . (2.3.10)

We can observe that the equation2.3.10is equivalent to the equation2.3.8ifQ=0. The result of this relaxed definition is an increase inBsvalue.

The main problem of using this new version ofBsis that for precise taxonomic range like species, the number of sequences belonging to each taxa is low on average. For example, if we consider two species,sp1andsp2, each of them represented by 2 sequencessa,sb

and sc, sd respectively; andsd is erroneously annotated as belonging sp1. In this case we need to setQ> 1/3 to tackle this error. But this high value ofQis unrealistic, and could lead to artificially increased value ofBc. A solution to this problem could be, not to consider each decision (this taxa is unambiguously identified) individually but as a set of decisions noised by a binomial process of wrong annotation of parameter pandn, where p=error rate inGenbank∼10%. Under this hypothesis we would have to select the set of decisions maximizing the likelihood and then computeBsaccording to it.

2.3.4 Falsely Increased Value ofBs

A taxon owns a set of barcode sequences that belong to a barcode region. According to our definition, an unambiguously identified taxon shares none of its barcode sequences with another taxa. Two taxa are considered to be sharing a barcode sequence if at least one barcode sequence of the first taxon is strictly identical to a sequence included in the set of barcode sequences of the second taxon. If we consider the possibility of errors during sequencing or PCR amplification, then it is possible to have certain taxa sharing some barcode sequences. Given two taxa t1 and t2 with one barcode sequences each i.e. s1ands2respectively, ifs1ands2differ by only one base pair we will not be able to distinguish them during the analysis of the results. Ifs1ands2are present in the results, we can propose three possibilities : i) Botht1andt2are actually present in the sample, ii) only t1 is present ands2is a reading error of s1, iii) the opposite situation. We can deal with this problem by changing the initial definition as following : Two taxat1andt2

and their associated sets of barcode sequencess1 ands2respectively are considered as unambiguously identified if and only if

sis1andsjs2: min(dH(si,sj))>dmin (2.3.11) Ifdmin=0 this new definition is identical to the original one. By increasingdminwe will have a measure ofBsmore robust but with a smaller value.

For computingBsfollowing this new definition, we build a graphG(S,D)whereS, the set of vertices, is composed of all possible barcode sequencessfor the considered marker and Dis a relation defined asdH(si,sj)≤ dmin. Eachc∈ C the set of all connected component

composing G can be considered as an equivalence class of barcode sequences. ThusBs

can be computed by substituting the setBwithC in the original definition.

No documento Tiayyba Riaz (páginas 80-84)