Font Size: a A A

Features Extraction And Prediction Methods For Protein Sequences

Posted on:2022-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L RuanFull Text:PDF
GTID:1520306335495094Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the gradual deciphering of human genome,much protein sequences and whole genome sequences with unknown functions come into being.Protein is an indispensable part of the living organism.Its position in the cell is concern the function for the protein.A mature protein can only perform its biological function normally when it is transported to the correct subcellular location,otherwise it will cause the emergence of a series of diseases.Correct and efficient understanding and prediction of the function of massive protein sequences data has important significance for the cognition of disease mechanisms and the development of drugs.This doctoral dissertation mainly focuses on the prediction of mitochondrial protein function and membrane protein function.The main research works are as follows:(1)Prediction method of mitochondrial protein function in local PSSM feature fusion.Firstly,this dissertation combined with protein sequence information,a feature expression method for improving pseudo-amino acid position-specific scoring matrix(IM-Psepssm)is proposed,under the same parameters,which the feature dimension is reduced and the location discrimination information is enriched.Secondly,based on the IM-Psepssm,three new feature description methods are proposed to mine the correlation information between different residues,and compare and analyze them with the improved pseudo-amino acid position specific scoring matrix.Thirdly,in view of the inherent key information sites of protein sequences,it is integrated into the position specific scoring matrix,a new Per PSSM Enhance Composition(P-PSSM-En Com)feature description method is proposed.Finally,by analyzing the probability of data length distribution in benchmark dataset,three different strategies are used to describe the continuous and discontinuous evolution information of sub-regions with different segmentation points based on PSSM to enrich the key feature information.The effectiveness of mitochondrial protein function prediction method based on local PSSM feature fusion is verified in different datasets.(2)Mitochondrial protein function prediction method based on integrated multisource feature expression.It is impossible for single or single attribute feature extraction methods to fully present all the effective information of the protein sequence.Although multiple feature fusion methods improve the accuracy of protein function prediction,there are still some difficulties as follows: firstly,feature fusion will increase the dimension of the feature vector to some extent;secondly,combining the feature description method with same attribute has poor complementarity among features;thirdly,there is no better treatment for the imbalance datasets;fourth,most of them utilize single classifiers or integrated vote classifiers,which do not make full use the feature attributes of each classification result.In response to the above problems,based on the evolutionary information features of the research content(1),the features of protein sequence content and physicochemical properties of protein sequences are integrated to describe the same protein sequence in three different complexity levels,respectively.Secondly,in view of the fact that the heterogeneity among features is ignored by feature-level fusion,a low dimensional meta-feature is constructed with ensemble learning strategy,and a resampling method is applied to improve the imbalance between sample categories and eliminate the influence of fuzzy boundary data on the classification model.The results indicate that the prediction method of mitochondrial protein function integrated with multi-source attribute features improves the accuracy of mitochondrial protein function prediction.(3)Membrane protein function prediction method based on optimized capsule neural network.Aiming at the problem that the pooling layer of the deep learning model causes the loss of some protein sequence information and there are need a large sample for training model,the capsule neural network is applied to predict the membrane protein function.It can be relief the part information loss bring about by the normalization of feature length of unequal PSSM,combining the distribution property of membrane protein data and the propagation process between Capent cells and the dynamic routing algorithm,the network parameters were optimized and the structure of Capent was improved;in addition,only PSSM as model input to mine the association features hidden in PSSM.Finally,based on the optimized capsule neural network model,a hybrid learning framework based on traditional machine learning and deep learning models is constructed.In this framework,41 benchmark models are trained,and the new features after traversal and screening are fused by decision-making level two-level feature fusion,and compares and analyzes the fused features using three integrated combining strategies.The experimental results show that in the Dataset1,Dataset2 and Dataset3 data sets,the model is 1.52%,2.26% and 2.67% higher than the optimal algorithm,respectively.(4)Development and design of the system.To facilitate the researchers to understand and implement the feature extraction method in this paper,based on the feature description method and sampling method involved in the research content(1)and(2),a prototype system is developed.Furthermore,high-quality and low-homology benchmark datasets were collected and constructed from the relatively new Uniprot protein database,and the datasets were updated and extended.
Keywords/Search Tags:Protein sequence, Protein function localization, Feature extraction, Ensemble learning, Data imbalance
PDF Full Text Request
Related items