A thesis submitted in partial fulfillment for the Master degree

In chapter 2 we describe related work and state-of-the-art methods on compositional bias detection, intrinsic disorder prediction and protein language models. In the next sections, we discuss the main algorithms that have been applied to modeling essential properties of proteins, such as biological properties and intrinsic disorder regions.

Protein Language Models

Self-supervised/Unsupervised protein language models

Moreover, the self-supervised learning on large amounts of data helps the models to learn. The authors used large protein sequence datasets for self-supervised training, such as UniRef50 [26].

Detection of intrinsically disordered proteins

Compositional bias detection

The technique produces a masked output sequence that is used for additional analysis, such as database searches, as well as detection of low-complexity regions. The low-complexity region identification is extremely selective for individual residue types, and this method has been shown to be suitable for masking sequences from database queries while avoiding false positives.

Figure 2.2: Overview of fLPS algorithm [2].

Intrinsic disorder prediction

The method is trained simultaneously with long and short IDRs and is insensitive to the adopted training dataset. To deal with the class imbalance problem, the method was trained to maximize the area under the curve (AUC) metric.

Figure 2.3: Example of IUPred predictions [3]. The threshold for classifying a disor- disor-dered protein is the black line (score¿=0.5).

Metagenomics

Sampling

Filtering is also an important part to obtain as much information as possible about the target cells and to release non-target material to prevent contamination of the sample. Physical separation of the required cells from the samples may also be necessary to ensure a representative DNA extraction. In these cases, amplification of the starting material is necessary, as most sequencing methods require nanograms or micrograms of DNA.

The final step is to maintain sample metadata such as sample location, sampling conditions, date, pH, temperature, marine depth, and altitude in terrestrial samples.

Sequencing technologies

Assembly

Binning

Annotation

In general, annotation is a two-step process, where the key features (genes) are found and then the gene functions and the taxonomic neighbors are assigned (functional annotation). Algorithms that take into account di-codon frequency, bias in codon usage, patterns in the use of start and stop codons and if possible, information on species-specific ribosome binding site patterns, ORF length and GC content of coding sequences are more suitable for gene prediction. It should be pointed out that annotation is not done from scratch, but rather by mapping existing information to gene or protein libraries.

Simulation tools

According to current estimates, only 20 to 50 percent of a metagenomic sequence can be annotated, leaving the pressing question of the remaining genes' relevance and function. The quality of the NGS datasets is displayed as read length, read quality, repetitive and non-repetitive indel profiles and single base pair substitutions are all provided by the tool. CAMISIM [ 49 ] is a simulation tool created to generate the simulated metagenome datasets used in the initial CAMI challenge.

Many aspects of the created communities and datasets can be customized using CAMISIM, including the total number of genomes (community complexity), strain diversity, community genome abundance distributions, sample sizes, number of replicates, and sequencing technique.

Figure 2.6: Overview of the metagenomic simulation process. [6]. The acronyms and the abbreviations of the figure are fully described in reference [6].

CAST

In other words, the above equation is the weighted score of all parameters Ca of the 20 amino acids. To speed up the search for computational complexity, dynamic programming is applied to model the above equations. Since the sequence locations of homopolymers are not important, the search technique based on the Smith-Waterman algorithm is simplified by using only one iteration for each search.

The above procedure is applied for all types of assembly bias to detect all possible low complexity regions for each amino acid.

SEG

This is equivalent to searching across a database of 20 homopolymers to find the bias regions. Finally, CAST has an option to mask biased regions with the undefined residue typeX, which can be ignored in further database searches and thus improve the specificity of the search strategy, i.e. these complexity states depend only on the numbers N, L, and nine and are independent—bulge of the residue composition and the states' probability of occurrence.

Complexity and entropy measures are similar in nature and can both be used to describe the compositional bias of proteins.

Figure 3.2: An example of SEG predictions.

IUPred

The prediction of long disorder, short disorder and structured domains is optional and requires different parameters for each category. Another potential application of this method is to find putative structured domains suitable for structure determination. Neighboring regions are combined, while regions less than 30 residues in length are omitted.

The region(s) expected to match structured domains are returned when this prediction type is selected.

IUPred2A

More specifically, the method analyzes the energy profile and finds continuous regions that are reliably predicted to be ordered. P is an energy prediction matrix relating the amino acid composition vector to the energy of a particular residue. Its parameters were tuned to minimize the discrepancy between the energy calculated from the known structures using the statistical potential and the energy estimated from the amino acid sequence.

The energy calculated for each amino acid residue is smoothed using the window size (w0) and translated into a score between 0 and 1, which allows it to be interpreted as quasi-probabilities that a certain residue is disturbed.

MobiDB-lite

The total stabilization energy contribution of intrachain interactions to a particular protein structure can be calculated using the sum of these energy components at the residue level. A unique approach was devised to estimate this energy directly from the amino acid sequence without a known structure. High-energy residues are predicted to be ordered, while low-energy ones are predicted to be disordered, based on the above calculation.

MobiDB-lite is intended to be highly specific in this approach, complementing traditional domain-oriented annotation of protein sequences.

ESpritz

If the amino acid position is present in the sequence at position k, Rkj = 1 otherwise, Rjk = 0. We denote aspk(x) the probability of finding the amino acid at position k along the sequence in a multiple sequence alignment. In the case of 20 amino acids, the profile sequence pk(x) is multiplied by the input vector as:.

GlobPlot

To smooth the curveΩ and obtain a numerical estimate of the first-order derivative, a digital low-pass filter is used. A basic peak finding method is then used to select putative spherical and disorder segments (referred to as PeakFinder). When the first derivative has positive (disorder) or negative (spherical) values along a continuous trajectory of the minimum length, the peaks are selected.

FlDPnn

In this chapter, we will initially describe our proposed method for predicting intrinsically disordered regions based on a transformer network. We then analyze our framework evaluating the predictions of several state-of-the-art IDP methods. Finally, we describe the implemented platform for simulating metagenomes, assembling proteins and identifying disordered regions for the first time.

More specifically, Transformer uses the self-awareness mechanism, which enables the network to construct highly informative representations that include context from several regions in the sequence.

Figure 3.3: Overview of FlDPnn algorithm [7].

Self-Attention mechanism

Multi-Head Attention Mechanism

Feed Forward Networks

FFN(X) = ReLU(W1X+b1)W2 +b2 (4.6) In addition, jump connections are used to solve the vanishing gradient problem and allow representations of different levels of processing to interact.

Positional encoding

Classifier

Optimization strategy

Pre-training
Supervised training for IDR

Framework for comprehensive comparison of intrinsic disorder predic-

Data Discretization

Intrinsic disorder prediction on metagenomes

Simulation of metagenomic data

The last option is to generate time-series metagenomic datasets with multiple linked samples. For this, a simulation similar to the Markov model is used, where each sample distribution is determined by the distribution of the previous sample plus an additional log-normal or Gaussian component. Genomic abundance profiles from the community design process are used to create metagenome datasets.

The number of reads produced is determined by the size of the genome and the total number of reads in the sample for each genome-specific taxon t and its abundance (t, abt)∈Pout.

Protein assembly

Then we sort the array by thek-mer value to find the rows that contain every k-mer and each row is aligned with the center of the set. The resulting array is divided into groups based on the center sequence, and the center sequence is aligned to each group member (mN alignments) without gaps. In the next step, each middle sequence is iteratively extended by the group member with the highest sequence similarity until no further extensions are available with a default minimum similarity threshold of 90%.

The output proteins are then fed to the GOP predictor mentioned in Section 3 to predict the ratio of the intrinsically disordered regions.

Figure 4.7: Workflow of Plass assembler [9].

Metrics

In this section, we will evaluate our proposed method for IDR prediction on several datasets. Next, we will use our proposed framework to compare several IDP methods and validate their predictions. Finally, we will run the IDP methods on simulated metagenomic data with different simulation parameters.

The Matthews Correlation Coefficient (MCC) measures the quality of the confusion matrix of binary classifiers and is defined as:.

Datasets

Implementation details

Evaluation

In table 5.3, a comparison of our method with state-of-the-art methods on the Disorder723 dataset is performed. On the Disorder723 dataset, the best performing setup is AUCpred compared to the other methods, achieving an M CC score of 0.564, showing that protein representation contains useful information for IDP. Again, CAST and SEG perform worse as they are computational methods and not designed for prediction of intrinsic disorders.

Finally, Table 5.4 tests our methods against several state-of-the-art methods based on the Critical Assessment of Protein Intrinsic Disorder Forecast (CAID) dataset on proteins from the DisProt dataset.

Table 5.3: Comparison of state-of-the-art methods on the Disorder723 dataset

Comparison of IDP methods

In Table 5.5, a comparison of PDP predictors is performed with raw input data in the MXD494 dataset. IUpred-short and ESpritz-Xray have the best performance with M CC scores of 0.769 and 0.741, respectively, and their predicted regions match better with the regions of other methods. In Table 5.6, we compare the PDP predictors using the post-discretization transformed data of the MXD494 dataset.

IUPred-short and ESpritz-Xray have the best predictions of the disordered regions, which match better with the regions of the other methods.

Table 5.5: Results on comparison of IDP methods with unnormalized data on MXD494.

Intrinsic disorder prediction on metagenomic data

Rost, "Modeling aspects of the language of life through transfer-learning protein sequences," BMC bioinformatics, vol. Xu, "Aucpred: prediction of protein disorder at the proteome level by auc-maximized deep folded neural fields," Bioinformatics, vol. Esnouf, "Ronn: neural bio-basis function network technique applied to detection of naturally disordered regions in proteins,” Bioinformatics , vol.

Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources,” Bioinformatics , vol.

Table 5.8: Results on comparison of IDP methods with normalized data on Disorder723

Overview of fLPS algorithm [2]

Example of IUPred predictions [3]. The threshold for classifying a dis-

Predictions of DisEMBL [4], where red color corresponds to disordered

Architecture of Spot-Disorder2 [5] based on bidirectional LSTMs and

Overview of the metagenomic simulation process. [6]. The acronyms

An example of CAST predictions

An example of SEG predictions

Overview of FlDPnn algorithm [7]

Overview of our proposed Transformer-based method that predicts dis-

Overview of the Multi head attention mechanism [8]

Proposed framework for IDP comparison

Data normalization applied on predictions

Framework for intrinsic disorder prediction on metagenomic data

Simulation parameters of CAMISIM

Workflow of Plass assembler [9]