Test set validation - Validation of the classification model

3. Results

3.5. Validation of the classification model

3.5.1. Test set validation

Prediction on the test set was performed, and the blocking procedure was tested with six different thresholds: 0.5, 0.6, 0.7, 0.8, 0.9, and 0.95. The number of levels that can be predicted when each threshold is applied is shown in Figure 3.4.

Figure 3.4. Distribution of the number of categories predicted when the blocking strategy is applied with each threshold.

When the threshold is increased, the significant increase in the number of compounds with prediction only at the Kingdom level shows that at the next level (Superclass), there is a bigger resistance for classification, meaning that there are more compounds that are blocked from prediction when compared to subsequent levels (Class and Subclass). This could be because there is a single classifier responsible for most of the Superclass-level compounds that deals with the vast heterogeneity of imbalanced categories, translating into a not-so-accurate prediction.

These results also show that most compounds can still reach the Subclass level with a very high probability of prediction. This raised the question, of if this behaviour was coming from the good performance of the classifiers or if, in part, was because a lot of these compounds were falling into the categories that do not need classifiers (only have one child) and therefore were maintaining high prediction probabilities across the hierarchy. Of all the compounds in the test set (95 713 compounds), only 5 255 (5,49%) compounds were found to have at least one category (of two that are possible - Superclass and Class levels) that would not need any classifier to predict the following level. When blocking is applied, the number of compounds in this situation relative to the total number of compounds after the blocking is presented in Table 3.11.

Table 3.11. The number of compounds with four predicted levels with at least one category (Superclass or Class levels) that does not need any classifier (only has one child).

No blocking Threshold

0.5 0.6 0.7 0.8 0.9 0.95

n 5255 2328 1743 1342 1031 724 554

Total 95713 74740 69023 64387 60002 55777 52603

% 5.49% 3.11% 2.53% 2.08% 1.72% 1.30% 1.05%

The same analysis was performed on the compounds with only three levels of prediction (without subclass), and the results are presented in Table 3.12.

Table 3.12. The number of compounds with three predicted levels and one superclass which does not need any classifier (only has one child).

Threshold

0.5 0.6 0.7 0.8 0.9 0.95

n 81 55 56 31 13 6

Total 5482 5550 5637 5517 5271 5016

% 1.48% 0.99% 0.99% 0.56% 0.25% 0.12%

This always corresponds to a minor percentage of the total compounds, which proves that the results are genuinely related to the performance of the classifiers. Additionally, when the threshold is increased, the percentage of these compounds does not increase, showing that these compounds having at least one local prediction with a probability of 1 does not seem to favour their overall prediction probability.

To evaluate the model's performance in the test set, the metrics used to evaluate and select the classifiers (F1-score with macro and micro average) were now computed for the prediction on the test set, separately in all four levels with and without the blocking strategy. These results, along with the number of compounds that remain when blocking is applied, are shown in Table 3.13.

Table 3.13. F1-score results from the top-down prediction approach on the test set and coverage of compounds when the blocking strategy is applied.

Threshold Kingdom Superclass Class Subclass

macro micro coverage macro micro coverage macro micro coverage macro micro coverage No

blocking 0.9895 0.9998 95713

(100.0%) 0.5483 0.8844 95713

(100.0%) 0.2981 0.7972 95713

(100.0%) 0.2632 0.7461 95713 (100.0%) 0.5 0.9895 0.9998 95713

(100.0%) 0.5830 0.9212 88995

(93.0%) 0.3355 0.9033 80222

(83.8%) 0.3072 0.8967 74740 (78.1%) 0.6 0.9918 0.9998 95708

(100.0%) 0.5986 0.9456 83770

(87.5%) 0.3276 0.9356 74573

(77.9%) 0.2900 0.9313 69023 (72.1%) 0.7 0.9937 0.9999 95701

(100.0%) 0.6197 0.9646 79325

(82.9%) 0.3039 0.9580 70024

(73.2%) 0.2554 0.9551 64387 (67.3%) 0.8 0.9951 0.9999 95691

(100.0%) 0.6391 0.9781 74998

(78.4%) 0.2709 0.9748 65519

(68.5%) 0.2222 0.9744 60002 (62.7%) 0.9 0.9961 0.9999 95679

(100.0%) 0.6261 0.9902 70096

(73.2%) 0.1955 0.9877 61048

(63.8%) 0.1574 0.9875 55777 (58.3%) 0.95 0.9971 0.9999 95665

(99.9%) 0.5509 0.9952 67019

(70.0%) 0.1481 0.9924 57619

(60.2%) 0.1104 0.9923 52603 (55.0%)

In Table 3.13, classification performance at each level decreases with the depth of the hierarchy, which would be expected since misclassifications are propagated downwards. Also, the blocking strategy gradually increases the macro and micro F1-score average, which is the ideal behaviour, and this means that the probability estimates made by the classifiers are meaningful. Except for the Kingdom level, when the threshold is increased, the F1-score macro starts getting lower at some point. This behaviour is because some small and harder-to-classify categories run out of TP, their F1-score then becomes 0, and the macro average lowers because it still is a category, despite not having any samples and therefore not having any weight on the micro average. This is also part of the reason for generally low macro F1-scores, because some small classes can not be predicted.

The MSCC classification system is based on elemental constraints, stoichiometric ratios, and mass constraints.[12] It was reported to have a 98,8% accuracy for classifying 6 categories (Lipids, Peptides,

Amino sugars, Carbohydrates, Nucleotides, and Phytochemical compounds). As we can see from Table 3.13, this classification model can classify compounds into 26 different categories at the Superclass level with 88,4% of accuracy, into 311 categories at the Class level with 79,7% of accuracy, and into 724 categories at the Subclass level with 74,6% of accuracy. Despite having lower accuracy than the MSCC, this classification model is far more descriptive. Additionally, the fact that the classifiers can output probability estimates allows for increasing the confidence of prediction and, consequently, the accuracy at the cost of decreasing compound coverage. Also, the dataset used for testing this model does not exclude isomers, which are very common and can have different classifications – in this dataset, from a total of 290 039 compounds, there is a total of 51 954 different chemical formulas and 79 154 combinations of classification/chemical formula. The existence of isomers makes it even more difficult for the correct assignment by more descriptive classifiers since isomers might belong to a broader same class but different, more specific ones.

Kingdom

The confusion matrixes of the Kingdom level are presented in Figure 3.5 (without blocking and blocking with the highest threshold, 0.95). The number of compounds is annotated in the confusion matrix, as well as the percentage of compounds relative to the sum of true labels, which across the diagonal will be the recall of each category. With class imbalance, recall is more informative than precision since it retains the true proportions of classes. A small category might be well classified, but if some bigger category is confused by this one, this can drastically reduce precision.

Figure 3.5. Kingdom level - confusion matrixes (no blocking and blocking with a 0.95 probability threshold) for the top-down prediction of the test set.

Regarding the Kingdom level, the scores (Table 3.13) and the confusion matrixes show that the prediction is almost perfect - only a few misclassified compounds - which would be expected from the results of the tuning’s cross-validation.

Superclass

Similarly, the confusion matrixes for prediction at the Superclass level are represented in Figure 3.6 and Figure 3.7.

Figure 3.6. Superclass level - confusion matrix for the top-down prediction of the test set (without blocking).

We can see in Figure 3.6 that, without blocking, “Acetylides”, “Allenes”, and “Organic compounds – Unspecified” do not have a single TP, and “Hydrocarbon derivatives”, “Lignans, neolignans and related compounds”, “Organic 1,3-dipolar compounds” have less than 20% of TP in the total of true labels (recall). This seems to be due to the representativity of these categories in the dataset since there are only a few samples of each in the test set (a maximum of 33), except the “Lignans, neolignans and related compounds” category, which has 143 samples but is probably very heterogeneous. The best-classified superclasses, with a recall of over 70%, are all of the inorganic ones (“Homogeneous metal compounds”, “Homogeneous non-metal compounds”, “Miscellaneous inorganic compounds”, and

“Mixed metal/non-metal compounds”), as well as some organic superclasses: “Hydrocarbons”, “Lipids and lipid-like molecules”, “Nucleosides, nucleotides, and analogues”, “Organic Polymers”, “Organic acids and derivatives”, “Organic salts”, and “Phenylpropanoids and polyketides”. The overall 88,4% of accuracy (F1-score micro) is good, however, it is mainly because the majority of the compounds belong to a single class - “Lipids and lipid-like molecules” –, which is well classified.

Figure 3.7. Superclass level - confusion matrix for the top-down prediction of the test set (with blocking - 0.95 threshold).

In Table 3.13, we can see that using the blocking strategy with a probability threshold of 0.95 can increase the accuracy by 11,1%, reaching a very high number of right-classified samples (99.5%), and a loss of 30,0% of compounds that become not classified at this level. When this threshold is applied (Figure 3.7), we can see that only 15 superclasses have at least one TP, compared to when no blocking is applied, in which case there are 23 superclasses with at least one TP (Figure 3.6), in a total of 26 superclasses. Using a probability threshold of 0.5, accuracy increases by 3,7% (92,1%), and there are only 7,0% of compounds without classification at this level. Additionally, there are still 23 superclasses with at least one TP (confusion matrix not shown). This means that a more permissive threshold might be a better balance between accuracy and data coverage.

Class

For this classification level, individual category scores (precision and recall) are shown in Figure 3.8 and Figure 3.9. The accuracy is 79,7%, and from 311 classes, there are 74 that do not have any TP (23,7%) - so their precision and recall are 0 -, 86 classes with recall lower than 0.2 (27,7%), and 44 classes with recall higher than 0.7 (14,1%). Compared to the previous level, more categories do not get predicted, and there is lower performance. Also, inorganic categories have a better performance than organic categories again. Of a total of 31 inorganic classes, 2 do not get predicted (6,5%), 1 has a recall lower than 0.2 (3,2%), and 18 have a recall higher than 0.7 (58,1%).

Figure 3.8. Performance of the classification of each Class (Precision and Recall metrics).

From Table 3.13, we can see that using a 0.5 prediction threshold, accuracy increases by 10,3%

(90%) with a 16,2% loss in the number of compounds, while with a 0.95 predictive threshold, accuracy increases by 19,5% with a 39,8% loss in the number of compounds. Despite accuracy being very high with a 0.95 threshold, there is a significant loss in compound coverage, which also happens with coverage of the number of classes. From a total of 311 classes, without blocking 74 classes do not get predicted, however, this number increases to 121 (0.5 threshold) and 260 (0.95 threshold) when blocking is applied. Analogously to the Superclass level, these results show that the balance between accuracy and data coverage must be considered, and one cannot be overly greedy and only look over the accuracy.

Subclass

Of 724 subclasses, 192 classes do not get predicted (26,5%) and are listed in the Supplementary Material (Table 6.19 and Table 6.20). The rest of the subclasses have their precision and recall plotted in Figure 3.10 – 3.12. The accuracy is 74,6%, there are 213 subclasses with recall lower than 0.2 (29,4%) and 86 subclasses with recall higher than 0.7 (11,9%). These results show that few subclasses have a good recall, and many do not get predicted or have a low recall. Compared to the previous level, the overall performance lowers again. Also, from a total of 48 inorganic subclasses, 5 do not get predicted (10,4%), 1 with recall lower than 0.2 (2,1%), and 27 with recall higher than 0.7 (56,3%), which shows that inorganic compounds have better classification performance again compared to organic compounds.

Regarding the blocking strategy, a 0.5 predictive threshold can increase accuracy by 15,1% while losing 21,9% of the compounds, and a 0.95 predictive threshold can increase accuracy by 24,6% while losing 45,0% of the compounds. Looking at the number of lost categories, in a total of 724 subclasses, a threshold of 0.5 loses 341 subclasses (47,1%), while a threshold of 0.95 loses 637 subclasses (88,0%).

As in the previous two levels, these results show that the accuracy and data coverage balance must be considered when using this strategy.

Figure 3.11. (continuation) Performance of the classification of each Subclass (Precision and Recall metrics).

No documento Unraveling compound taxonomies in untargeted metabolomics through artificial intelligence (páginas 52-64)