Font Size: a A A

The Intelligent Identification Of The Protein Thermostability

Posted on:2017-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:X M GaoFull Text:PDF
GTID:2271330488482277Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The high thermostability of protein has a significant application value. Especially, it has important theoretical and practical significance in fermentation engineering, food, brewing, medicine, environmental protection and metal smelting. Numerous investigations have been carried out to understand the mechanism influencing thermophilic protein stability, the amino acid composition, the dipeptide composition, hydrogen bonding, salt bridge and hydrophobic effect were found by analyzing the protein sequence, structure and protein engineering method. The intelligent algorithm plays an important role in the process of research of protein thermostability. Currently, the intelligent algorithm has been widely used in identifying the protein thermostability, such as LogitBoost, Support Vector Machine, Neural Network, Decision Tree, Bayesian Method, Random Forest, K-Nearest Neighbor algorithm, and so on. Recently, using complex network theory to study thermostability mechanism gradually became a kind of effective method. Researchers have analyzed the protein thermostability mechanism in the view of system, encoding three-dimensional structure of proteins into a residue interaction network. In this article, the machine learning algorithm and complex network theory will be combined to analyze the protein thermostability.Firstly, two protein datasets were constructed, the first one was derived from the PGTdb and PDB database. According to the optimal growth temperature information of microbe, organisms are classified as either thermophiles or mesophiles. It is used for identifying the thermophilic protein and mesophilic protein. The second was derived from ProTherm and PDB database, which signed the protein melting temperature. This dataset will be used to prediction of the protein melting temperature.Based on the first dataset, the amino acid composition and dipeptide composition were calculated to be used as the feature vectors. In the process of recognition based on sequence characteristics, because of the recognition accuracy be determined by the 20 kinds of special amino acids of protein, in order to obtain the contribution rate of each amino acid, we delete the composition of one type of amino acid in feature vector matrix one time, and used the remaining features as the input vector to discriminate thermophilic and mesophilic proteins. Then, we will obtain the importance amino acids on protein thermostability by the prediction rate, and analyzing the effect on the thermostability. The removal of Arg, Leu, Val and Lys are particularly remarkable, they reduce the prediction accuracy by at least 3%. They have higher tendency to participate in salt bridge, hydrophobic effect and hydrogen bond, which would stabilize protein structure. The prediction performance has greater changes in the case of the lack of these amino acids.In addition, based on the first dataset, using the three-dimensional information of protein, setting the cut-off radius at 6.5?, residue interaction network was constructed. The residue network topology parameters: the average connection strength, the average degree, the characteristic path length, the clustering coefficient, the weighted clustering coefficient, the closeness centrality and the residue centrality were calculated as feature vectors. The average discrimination accuracy of five-fold cross validation of SVM increased to 87.50%, 89.71% of mesophilic proteins were classified correctly, and 85.29% of thermophilic proteins were classified correctly. We added one type of network topology property in feature vector matrix one time, we found the characteristic path length and closeness centrality greatly improved the discrimination rate of thermophilic proteins by prediction accuracy. The main reason is thermophilic proteins have more rigid structure, highly stable and strong interactions between residues, which causes them to have shorter characteristic path length and closeness centrality.In order to predict the protein melting temperature directly from the term of the sequence and structure of protein, the prediction of protein melting temperature based on swarm intelligence algorithm was proposed based on the second dataset. Using the hybrid algorithm that combined artificial bee colony and particle swarm, which optimized the parameters of multivariate linear regression model, we calculated the protein melting temperature based on amino acid composition. Additionally, by adding amino acid network topological properties to amino acid composition, the prediction accuracy was greatly improved. For mesophilic protein, the correlation coefficient between the predicted value deviation and the real value deviation increased to 0.71, the average of prediction accuracy increased to 88% compared with the result using amino acid composition; and the correlation coefficient increased to 0.75, the average of prediction accuracy increased to 91% of thermophilic protein.In this study, we used residue network topology parameters as the feature vectors, to improve the performance of Support Vector Machine trained to discriminate thermostable proteins from mesophilic proteins. The protein melting temperature is a vital descriptor for protein thermostability, which determined the state of unfolding of protein. It has the important theoretical and practical significance to predicting the protein melting temperature.
Keywords/Search Tags:protein thermostability, amino acid interaction network, Support Vector Machine, multivariate linear regression, protein melting temperature
PDF Full Text Request
Related items