Usando modelos de regressão logística e decomposição em

Background

Conclusions

Usando modelos de regressão logística e decomposição em

decomposição em valores singulares para a seleção de atributos

importantes para classificação de sequências protéicas

by using logistic regression models and singular

value decomposition

Braulio RGM Couto

, Marcelo M Santoro

, Amjad Ali

, Marcos A Santos

Programa de Doutorado em Bioinformática, Universidade Federal de Minas Gerais,

UFMG, Belo Horizonte, Minas Gerais, Brasil

Departamento de Ciências Exatas e Tecnologia, Centro Universitário de Belo

Horizonte, UNI-BH, Belo Horizonte, Minas Gerais, Brasil

Departamento de Bioquímica e Imunologia, UFMG, Belo Horizonte, Minas Gerais,

Brasil

Laboratory of Molecular and Cellular Genetics (LGCM), Departamento de Biologia

Geral, ICB/UFMG, UFMG, Belo Horizonte, Minas Gerais, Brasil

Departamento de Ciência da Computação, UFMG, Av. Antonio Carlos 6627, Belo

Horizonte, Minas Gerais, 31270-010, Brasil

*These authors contributed equally to this work

Corresponding author

Email addresses:

BRGMC: braulio.couto@unibh.br

MMS: santoro@icb.ufmg.br

AA: amjad_uni@yahoo.com

MAS: marcos@dcc.ufmg.br

Abstract

Searching for relevant patterns in protein sequences is a critical Bioinformatics goal.

In this work we will present a computational tool to support genomic research that

uses logistic regression models and singular value decomposition to feature selection

and protein sequence classification. Firstly, we consider a biomolecular sequence as a

complex written language that is recoded as p-peptide frequency vector using all

possible overlapping p-peptides window. With 20 amino acids it generates a 20

high-

dimensional vector, where p is the word-size. Each vector row is the peptide that is

analyzed by logistic regression to feature selection for the protein sequence

classification. If we use a word-size window (p=1) one of the features analyzed, the

amino acids are important for a group of proteins. With p=2 we can identify

bipeptides associated with a specific sequences group. Besides peptides we include

sequence length as another feature candidate. The model-building strategy for the

feature selection was an automatic forward stepwise logistic regression. After the

feature selection step, proteins are recoded again only by the p-peptides selected as

important for each sequences group. The rank of the protein frequency matrix

produced for each target group is reduced by singular value decomposition (SVD) and

the results are used to classify unknown sequences. A database with 516,081

sequences from the Swiss-Prot section of the Universal Protein Resource (UniProt)

was the protein collection used in all analysis. We tested the method in seven target

groups: insulin, globin, keratin, cytochrome and proteins related with cystic fibrosis,

Alzheimer disease and schizophrenia. A case-control study was done to examine each

target group. In this approach, sequences from the target group (the cases) are selected

from database for comparison with series of random sequences where the protein is

and restricted, much smaller than the number of controls. In order to try an optimal

allocation of cases and controls during each feature selection analysis, we used a 1:4

case: control ratio. The ratio of four random controls to each case (4:1) compensates

few numbers of cases, being enough to detect features related to each protein group.

Combined method was able to identify the amino acids and bipeptides important to

each protein group. Sensitivity to classify unknown sequences using the SVD system

based on the initial matrix with 400 rows, ranged from 76% for proteins related with

Alzheimer disease and more than 90% for other six groups. All specificities were over

90% for all proteins. After frequency matrix reconstruction using only bipeptides

identified by the logistic regression, decomposition by SVD and subsequent rank

reduction, query retrieval has a sensitivity ranging from 74% for cytochrome to more

than 90% for globin, keratin and proteins related to cystic fibrosis and schizophrenia.

As for the initial matrix, all specificities in this situation were over 90% for all

proteins.

In addition to the feature selection, combining logistic regression models with

singular value decomposition method allows better classification of unknown

sequences than using SVD alone. Matrices used by the combined method are much

smaller than the original one, which leads to optimized oracles. The tool is perfectly

scalable and adaptable to huge problems because it is independent of reference

database size and much less from the length of involved sequences.

Background

Searching for relevant patterns in protein sequences is a critical Bioinformatics goal.

Detected patterns can be used to classify unknown sequences predicting genes family

or can be used to describe evolving relationship of sequences group, instead of being

associated with the function or structural proteins stability [1]. Here we present a

computational tool to support genomic research using a new method based on logistic

regression models to feature of protein sequence classification selection.

Firstly, we consider a bio molecular sequence as a written language that is recoded as

p-peptide frequency vector using all possible overlapping p-peptides window. The