Results - Rotation prediction by H-bond pattern matching

7.2 Rotation prediction by H-bond pattern matching

7.2.2 Results

The dataset used for the analysis was a collection of 8182 proteins of known geometric structures taken from the protein data bank [20], containing approximately 1.16 million hydrogen bonds. The dataset was processed in the way

Range ofd Run 1 Run 2 Run 3 Total 0≤d <0.2664 (0.1%) 60.41 58.55 59.44 59.47 0.2664≤d <0.4567 (0.5%) 80.30 78.89 79.55 79.58 0.4567≤d <0.5766 (1.0%) 86.61 85.75 86.08 86.15 0.5766≤d <0.7862 (2.5%) 92.45 92.42 92.26 92.38 0.7862≤d <0.9968 (5.0%) 95.32 95.13 95.29 95.24 0.9968≤d <1.2689 (10.0%) 97.23 96.82 97.14 97.06 1.2689≤d <1.7663 (25.0%) 98.73 98.56 98.71 98.67 1.7663≤d <2.3099 (50.0%) 99.72 99.62 99.65 99.66 2.3099≤d <2.7437 (75.0%) 99.89 99.91 99.87 99.89 2.7437≤d <3.1416 (100%) 100.00 100.00 100.00 100.00 Number of bonds 25612 25623 23998 75233 Table 16: Accumulative % of hydrogen bonds whose predicted rotation values lie within the specified distance from the true rotation values. The numbers in parentheses show the volume of the ball whose radius is the upper limit of the range as a proportion of the volume of entire SO(3), w.r.t. the invariant metric on SO(3).

described in section 5.6. A subset of 200 proteins was randomly selected from the collection, and the training procedure was performed on the remaining 7982 proteins. The H-bond pattern identification was performed with the following combinations of parameters;

• Window size: 0 (only the central bond), 1,2, . . . ,10

• Remote bonds: 0 (do not indicate remote bond), 1,2, . . . ,window size

• Twisted bonds: none (-1), only the central bond (0), window size

• Number of residue classes: 1, 2, 3, 4

This resulted in 792 parameter combinations. For each of these parameter combinations, the H-bond patterns with less than 30 occurrences were discarded.

For each of the remaining H-bond patterns with the associated SO(3) rotations, we performed the clustering analysis (see [73] for the description of clustering algorithm), and computed a score for each of them.

Next, we applied the prediction procedure outlined above to the remaining 200 proteins. We assessed the quality of each prediction by measuring the distance between the predicted and the actual rotation. The analysis was repeated three times with different choices of 200 proteins to be used for prediction. In the following, unless otherwise specified, the accuracy figures quoted are averages over three runs. The results are shown in table 16.

We see in 59.47% of all cases, the predicted rotation lies within a ball com- prising just 0.1% of the total volume of SO(3) centred at the true rotation, and in 86.15% of all cases, the prediction was within a ball corresponding to 1%

of the volume of SO(3). Since the size of PDB is continuously increasing, we updated our dataset in 2017 to obtain an expanded dataset with 13115 proteins, containing approximately 1.89 million hydrogen bonds. The results from this expanded dataset were very similar, with 85.99% of the prediction lying inside a

Range ofd Dataset 1 Dataset 2 0≤d <0.2664 (0.1%) 59.47 59.26 0.2664≤d <0.4567 (0.5%) 79.58 79.52 0.4567≤d <0.5766 (1.0%) 86.15 85.99 0.5766≤d <0.7862 (2.5%) 92.38 92.29 0.7862≤d <0.9968 (5.0%) 95.24 95.17 0.9968≤d <1.2689 (10.0%) 97.06 97.16 1.2689≤d <1.7663 (25.0%) 98.67 98.69 1.7663≤d <2.3099 (50.0%) 99.66 99.66 2.3099≤d <2.7437 (75.0%) 99.89 99.85 2.7437≤d <3.1416 (100%) 100.00 100.00 Table 17: Accumulative % of hydrogen bonds whose predicted rotation values lie within the specified distance from the true rotation values, comparison between the old (Dataset 1) and the new (Dataset 2) datasets

DSSP class DSSP symbol Local structure

α-helix H

Helix

310-helix G

π-helix I

Strand E Sheet

Isolatedβ-bridge residue B

Coil

Turn T

Bend S

Unclassified - Unclassified

Table 18: DSSP classes and corresponding local structure patterns.

ball corresponding to 1% of the volume of SO(3) (table 17). We have therefore chosen to perform the following analysis using the smaller dataset due to the extra time required to analyse the larger dataset. To analyse the prediction results further, we have looked at the DSSP classes [51] of four residues around each hydrogen bond. DSSP assigns seven secondary structure classes (plus “unclassified”) to each residue (table 18). The four residues around a hydrogen bond are chosen and recorded in the following order;

• The residue preceding the N-donor amino acid residue.

• The N-donor amino acid residue.

• The O-acceptor amino acid residue.

• The residue following the O-acceptor amino acid residue.

We analysed the frequencies of four-tuples of DSSP classes and the associated prediction accuracies (figure 62). We can observe the frequencies concentrated on a few classes, which is also evident if we look at the ten most frequent DSSP class combinations (table 19). We also observe that the residue class combina- tion for the sheet structure (“EEEE”) has the high frequency and relatively low accuracy.

(a) Accuracy of predictions by the DSSP classes of four residues around the hydrogen bond; % of predictions lying within 1% SO(3) volume.

(b) Frequency by the DSSP classes of four residues around the hydrogen bond.

Figure 62: Accuracy and frequency of predictions by DSSP classes. The horizontal and vertical axes show the donor-side

111

Residues around H-bond Frequency Accuracy

HHHH 9296 99.83%

EEEE 5604 79.59%

HH-H 780 99.36%

HTHH 694 97.84%

T–T 405 80.74%

TTHH 390 96.15%

T-HH 339 88.79%

GG-G 290 93.45%

TS-T 281 87.54%

H-HH 263 91.25%

Table 19: 10 most frequent combinations of DSSP classes around hydrogen bonds, together with the proportions of predictions, which lie inside a ball centred at the true rotation having a volume corresponding to 1% of the total volume of SO(3).

We then analysed each prediction and looked at the parameters used to produce it. By looking at the distance between predicted and true rotations (∆) and the mean distance to cluster mode (m) for each prediction, we found a group of predictions made using small window sizes, with many of them having large ∆ values (figure C.1). This could happen, for example, if the H-bond pattern matched with a smaller window size has the associated cluster with a well-defined “peak” (thus having a low m value and high score), while the cluster for a match with a larger window size has a lower “peak” (and a high m value). To encourage the use of larger pattern for prediction, we modified the score function (59) to penalise the use of smaller patterns. The new score function is given by

(π−m−exp(3−w) if there is only one cluster

−1 if there are more than one cluster, (60) wheremis the mean distance to the cluster mode, andm= max{3,window size}. Using the new score function, we achieved the prediction accuracy of 89.05%

inside 1% SO(3) volume (table 20). Analyses of the other parameters (Remote bonds, Twisted bonds, and Residue groups) did not show any similar anomalies, and applying similar modifications to the score function to prioritise matches with more detailed patterns did not result in an improvement.

We have also run the analysis using the local H-bond patterns generated including the tertiary bond information, where a tertiary bond counts as one backbone bond in computing window size, and we do not consider atoms more than one tertiary bond away from the central bond. In other words, an atom is only included in the H-bond pattern if the distance (along the backbone and the tertiary bond edges) from it to either end of the central bond is less than or equal to the window size, and that no more than one tertiary bond is traversed in computing the distance. This criteria was applied to limit the pattern to the atoms most likely to exert some influence on the central H-bond. We found that 82.01% of all predictions lay within 1% SO(3)-volume of true values (table 20).

Range ofd Reference New score Tertiary 0≤d <0.2664 (0.1%) 59.47 64.75 53.16 0.2664≤d <0.4567 (0.5%) 79.58 83.47 74.51 0.4567≤d <0.5766 (1.0%) 86.15 89.05 82.01 0.5766≤d <0.7862 (2.5%) 92.38 94.02 89.70 0.7862≤d <0.9968 (5.0%) 95.24 96.24 93.60 0.9968≤d <1.2689 (10.0%) 97.06 97.64 96.16 1.2689≤d <1.7663 (25.0%) 98.67 98.99 98.16 1.7663≤d <2.3099 (50.0%) 99.66 99.77 99.69 2.3099≤d <2.7437 (75.0%) 99.89 99.93 99.95 2.7437≤d <3.1416 (100%) 100.00 100.00 100.00 Table 20: Accumulative % of hydrogen bonds whose predicted rotation values lie within the specified distance from the true rotation values, for reference data (table 16), with the new score function (60), and with tertiary bonds.

No documento q g m - centre f o r quan t u m g e o m etry o f m o dulis p a c es (páginas 124-129)