Font Size: a A A

Low-similarity Protein Structural Class Prediction Based On Multiple Features

Posted on:2019-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhuFull Text:PDF
GTID:2310330563954129Subject:Biology
Abstract/Summary:PDF Full Text Request
The successful completion of the Human Genome Project resulted in a dramatic increase in nucleic acid sequences and protein sequences,and most of the protein functions are unknown compared to nucleic acid sequences.Therefore,it is very important to find useful functional information from these protein sequences.The protein sequence determines the structure of the protein,which in turn determines its function.However,traditional experiments are difficult to achieve a large number of protein sequence analysis.Thus machine learning algorithms have been developed in bioinformatics to study the spatial structure and function of the protein.Protein structural classes can react to the secondary and tertiary structures of proteins,and are closely related to protein function.Therefore,this paper will use protein structural class as the research object,and uses the machine learning method to study the spatial structure of the protein.And its main contents are as follows:First,a reliable and rigorous benchmark dataset was constructed in this paper.Its similarity was ~15%,including 399 protein sequences.Second,the tripeptide composition,position specificity score matrix,and predicted secondary structure information and averaged chemical shift were used to characterize the protein sequences.For the high-dimensional tripeptide composition,the binomial distribution and increment feature selection were used to select the optimal tripeptide features to avoid overfitting,and the optimal tripeptide composition of 1254 were obtained.Subsequently,the prediction model of four kinds of features and 11 combined features was constructed using support vector machines.The jackknife cross-validation results showed that the overall accuracy of the optimal tripeptide composition in the prediction performance of the four features is up to 91% and the average accuracy is 90.5%.After the fusion feature,there are 5 new features with accuracy higher than 95%,and 3 feature combinations higher than 90%.Moreover,the maximum accuracy(96.4%)was avhieved by combing optimal tripeptide composition with chemical shift.In fact,the performance of position specificity score matrix is the worst when comparing with three other features in this paper.Compared with other existing approaches,it was found that the method proposed in this paper is superior to other published methods.In addition,for the bestperformance model features for support vector machines,this paper also compares different classification algorithms,including J48,Naive Bayes,artificial neural networks,and so on.The results indicated that our model based support vector machine have a certain advantage than other algorithms.Thus,the method can be used as a reliable tool for the accurate prediction of protein structural class for low-similarity sequences.
Keywords/Search Tags:protein structural class, feature extraction method, feature fusion, low-similarity, machine learning algorithm
PDF Full Text Request
Related items