Font Size: a A A

Peptide/Protein Sequence Feature Extraction And Its Application

Posted on:2013-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:M X SuFull Text:PDF
GTID:2210330374970962Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Experiment determination of the biological activity of antibacterial peptides (AMPs) and protein phosphorylation sites is thought to be very time-consuming, laborious and costly. Meanwhile it is difficult to do the experiment. Therefore, using the quantitative sequence-activity model (QSAM) study the relationship between AMPs'sequence and their biological activity,and establishing the method of automation predicting protein phosphorylation site in accordance with the existing data is very important. It will provide guidance information for the design and synthese of peptide drugs and the protein phosphorylation group research.The feature extraction and modeling methods are the key steps of AMPs'QSAM and predicting protein phosphorylation sites. Amino acid sequence of peptide/protein determines its structure and function, senior structure is very difficult to get. So when predicting peptide/protein structure and function, characteristics are extracted directly based on amino acid sequence. Support vector machine (SVM), which is based on the statistical theory and the minimal structure risk, is synthesizer in the machine learning field. SVM include support vector classification (SVC) and support vector regression (SVR). But the support vector machine modeling has high time complexity when the training set is too big, at the time we used another classifier named relaxed variable kernel density estimation (RVKDE) for modeling instead of SVM. The article introduced a variety of feature extraction methods based on peptide/protein sequence. After that proposed several new feature extraction algorithms and applied them in antimicrobial peptides QSAM and predicting protein phosphorylation sites using SVM or RVKDE. The results reported as follows:Antimicrobial peptides QSAM modeling. The article report three new feature extraction methods, geostatistics-amino acids531properties (GS-AA531), multi-scale component and correlation (MSCC) and a combination of GS-AA531and multi-scale component (GS-AA531-MSC),identified through integrating the information in peptide or protein primary structure. The calculations to identify these methods are simple, only based on amino acid sequence, suitable to peptides with different lengths and can capture the context features. The new feature extraction methods and other reference method were applied to two AMPs systems (equal and unequal length peptides) for constructing QSAM models combined with features screening. The accuracies of fitting, leave-one-out cross validation, and extra-sample prediction for the models based on GS-AA531and GS-AA531-MSC improved significantly compared with those based on the other methods. Therefore, the new peptide or protein feature extraction methods GS-AA531and GS-AA531-MSC are promising for broad applications in peptide or protein QSAM study.Protein phosphorylation site prediction. Protein phosphorylation, as an extremely important protein posttranslational modification, participates in almost all life activities. The article put forward two new peptide/protein sequences feature extraction methods statistical difference table (SDT), the combination method MSCC-SDT which integrated MSCC and SDT. Based on RVKDE or SVC, those three-methods MSCC, SDT, MSCC-SDT were used in the protein phosphorylation site prediction research. The results in the classic data set Phospho.ELM showed that the performance was MSCC-SDT> MSCC> SDT. Compared with the predicting result from several online website which are independent kinase such as AutoMotif Server AMS, NetPhos, DISPHOS, PHOSIDA and Scansite:MSCC-SDT was superior to all online website results, MSCC was better than most of the web site, SDT was better than part of the website. So the combination method MSCC-SDT which integrated sequence internal features (MSCC) and external characteristics (SDT) is more suitable for prediction protein phosphorylation.
Keywords/Search Tags:Peptides/Proteins, Sequence Characteristics Extracted, Support VectorMachine (SVM), Proteins Phosphorylated, Quantitative Sequence-Activity Model
PDF Full Text Request
Related items