Font Size: a A A

Research On Features Extraction Methods For Prediction Of Protein Structural Classes And Subcellular Localization

Posted on:2018-03-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y LiangFull Text:PDF
GTID:1360330542993496Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Prediction of protein structural classes and protein subcellular localization plays a crucial role in the prediction of protein structure and function,they are not only two core contents of bioinformatics research in the 21 st century,but also two typical pattern recognition problems of proteomics research in the post-genomic era.This thesis constructs a multiclass classification prediction model based on the theory of support vector machine and a complete set of prediction performance evaluation system.Two more effective feature extraction methods are proposed for the prediction problems of protein structural classes and apoptosis protein subcellular localization,respectively,and support vector machine is used as a classifier to perform prediction.The main contributions are summarized as follows:1.We study the problem of protein structural classes prediction with low-similarity.A feature extraction method is proposed by fusing the global and local features based on the evolutionary information represented in the form of position-specific scoring matrix(PSSM).Global features are extracted from a consensus sequence based on PSSM,the amino acid type of each position for this sequence is composed of the corresponding amino acid type having the highest score in each row of PSSM.Global features include two aspects: amino acids composition features and composition moment features based on the consensus sequence.Local features are extracted from segmented PSSM of equal length,which also include two aspects: pseudo-PSSM features and auto covariance features based on all segmented PSSMs.In order to reduce influence of the redundancy in the features for the prediction performance of SVM,principal component analysis method is used for dimensionality reduction of extracted features.Our proposed method is a novel method based solely on evolutionary information for the protein structural classes prediction of low-similarity datasets.The experimental results show that this method not only can further improve the prediction accuracy,but also is an important supplement for the other PSSM-based prediction methods.2.To solve the problem of protein structural classes prediction for two large sample datasets with low-similarity,a feature extraction method of multiple information fusion is proposed based on predicted secondary structure sequences(PSSS)and PSSM.For the PSSS-based features,we propose the occurring frequency of 2-words EH and HE of the reduced secondary structure sequence on the basis of the existing typical features,and calculate a normalized Lempel-Ziv(LZ)complexity for the secondary structure sequence.For the PSSMbased features,we obtain 3600 high-dimensional positive features by the autocross correlation function,in order to reduce the redundancy and computational complexity,we use nonnegative matrix factorization algorithm for features transformation to achieve the purpose of dimensionality reduction.The experimental results verify that this method obviously raises the prediction accuracy of protein structural classes,and makes a positive contribution to improve the prediction accuracy,especially for the ?+? class.3.We study the problem of apoptosis protein subcellular localization prediction.A statistical feature extraction method is proposed by using detrended cross-correlation coefficients of non-overlapping windows based on PSSM.Detrended cross-correlation coefficient is a novel method for measuring the level of cross-correlation between two non-stationary time series,and arbitrary two different columns of PSSM generated by apoptosis protein sequence can be viewed as two non-stationary time series.Through analysis and discussion the selection problems of the order for fitting polynomial and the s for optimal equal nonoverlapping windows length,we calculate detrended crosscorrelation coefficients for arbitrary two different columns of PSSM as features to predict subcellular localization.The experimental results show that this method firstly has important and successful application of the new statistical method in pattern recognition problem.4.Consider the problem of subcellular localization prediction of apoptosis protein,a feature extraction method of multiple statistical information fusion is proposed based on PSSM.By studying the selection problems of the parameter lag for Geary correlation factor and the s+1 for equal overlapping windows length,we fuse the sequence-order information of Geary autocorrelation and the detrended cross-correlation coefficients information of overlapping windows based on PSSM as features to perform the prediction of subcellular localization.The experimental results on three benchmark datasets illustrate that this method not only improves the prediction accuracy of apoptosis protein subcellular localization,but also is a more comprehensive and effective statistical features extraction method.
Keywords/Search Tags:Protein Structural Classes, Protein Subcellular Localization, Support Vector Machine, Feature Extraction, Position-Specific Scoring Matrix, Predicted Secondary Structure Sequence, Auto-Cross Correlation Function, Detrended Cross-Correlation Coefficient
PDF Full Text Request
Related items