Font Size: a A A

The Research Of Protein Secondary Structure Prediction Algorithm Based On Decision Forest

Posted on:2020-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y P LiFull Text:PDF
GTID:2370330575992718Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Protein is an important component of the human body and almost all activities in the body require the participation of proteins with specific functions.The spatial structure of a protein determines its primary function.Therefore,the study of protein structure helps to better understand its function.However,it is not possible to understand its spatial structure directly by simulating the folding process of proteins.However,proteins are composed of amino acid sequences.Therefore,it is a common method to predict the secondary structure of a protein by its amino acid sequence and to understand its three-dimensional conformation.In the era of rapid development of big data,cloud computing and artificial intelligence,the use of machine learning to predict the secondary structure of proteins has become a research hotspot in bioinformatics.Based on the decision tree forest model and machine learning technology,this paper deeply studies the eight types of secondary structure prediction of protein,the main research contents are as follows:Aiming at the problem of eight types of secondary structure prediction of protein,a decision forest prediction algorithm based on gradient lifting is proposed.The algorithm uses the second-order Taylor approximation of the cross-entropy loss function as the optimization target based on the PSSM spectral characteristics of the amino acid sequence.The mapping function determined by the decision tree is used as the optimization parameter,and the decision tree is constructed by greedily selecting the best split point on the eigenvalue.In addition,in order to prevent over fitting,2L regularization term is further introduced in the objective function to control the complexity of the model.On the standard CB513 protein secondary structure evaluation data set,the proposed algorithm achieves 64.89%Q8 accuracy.Aiming at the shortcomings of the gradient improvement decision forest algorithm running slow speed,this paper proposes a fast gradient lifting prediction model based on the histogram idea.The model discretizes the sample features by the histogram method.The data is sampled by a single-edge gradient technique for a large number of sample data,and the feature binding technique is used to reduce the dimension of the multi-dimensional features,realizing the two dimensions of sample size and feature.Parallel.Through a large number of experiments,the indicators affecting the performance of the model are analyzed.The experimental results show that theQ8 accuracy of the test set is 66.35%based on the fast gradient lifting algorithm proposed in this paper.In addition,on the same data set,compared with other algorithms,the proposed algorithm runs very fast and the time complexity is very small.
Keywords/Search Tags:Protein Secondary Structure Prediction, Sliding Window, Gradient Boosting, Decision Forest
PDF Full Text Request
Related items