| Thermal stability is one of the fundamental properties of proteins and other biomolecules,and the study of protein thermal stability has significant implications in the fields of protein engineering,biotechnology,and industrial design.Protein thermal stability can be measured from two aspects:one is to classify proteins into thermophilic or non-thermophilic categories,and the other is to measure the protein’s melting temperature.This thesis utilizes computational methods to construct prediction models based on the sequence information of multi-species proteins for these two aspects,before further predicting the effects of singlepoint mutations on protein thermal stability.The contents are as follows:(1)A deep learning model for protein thermophilicity classification prediction named DeepTP based on feature fusion is proposed in response to the issue of insufficient use of multi-dimensional features in existing methods.Firstly,20978 proteins in 2427 species with optimal growth temperatures were extracted from published databases and preprocessed to construct the dataset.Then,6 groups of biochemical features were calculated based on protein sequences,meanwhile the convolutional neural networks combined with bidirectional long short-term memory neural networks and self-attention mechanism were employed to extract sequence information.These extracted features were fused with biochemical features to build the final prediction model.DeepTP has better performance on both balanced test set and validation set,with AUC reaching 0.964 and 0.764 respectively.On the unbalanced test set,DeepTP achieves an average precision of 0.646.(2)A deep learning model ProTstab2 for predicting protein melting temperature based on the self-attention mechanism is proposed to solve performance problems caused by small training dataset and insufficient utilization of protein sequence information in reported models.Firstly,the melting temperature data for 13 species including human,mouse,zebrafish,drosophila melanogaster,etc.were extracted from published literature and then preprocessed to construct a larger and more comprehensive dataset.Then,biochemical features were calculated based on the protein sequences,and the recursive feature elimination method with cross-validation was used for feature selection,resulting in the selection of 464 features.Simultaneously,the Word2vec model was used to pre-encode the protein sequences,and then a combination of convolutional neural network and self-attention mechanism was used for feature extraction.Finally,the prediction model was trained based on these two types of features.ProTstab2 achieves higher prediction accuracy on both the 10-fold cross-validation and the test set.(3)Furthermore,a deep learning model PON-Tm based on feature processing with the protein language pre-training model ESM is proposed to predict the impact of single point mutations on protein thermal stability.This model aims to solve the low prediction accuracy problem caused by the insufficient utilization of contextual sequence information at mutation sites in existing models.Firstly,a single-point mutation thermal stability dataset of 108 species,including human,mouse,maize,E.coli,etc.was extracted and preprocessed from the ProTherm and MPTherm databases.Then,the ESM pre-training model is used to pre-encode the contextual sequence information at the mutation site,and feature extraction is performed based on the self-attention mechanism.Finally,the extracted features,along with evolutionary information and the predicted melting temperature from ProTstab2,were used as inputs to construct a regression model.Compared to other methods,PON-Tm achieved higher prediction accuracy and lower prediction error,with MAE of 3.798℃ and 3.590℃ in 10-fold cross-validation and test sets,respectively.This thesis is entirely based on the sequence information of proteins,and uses deep learning methods to construct three prediction models related to protein thermal stability in multiple species.These models achieved high prediction accuracy and generalization ability and thus can provide important references for the thermal stability modification of proteins.Moreover,they are valuable in enzyme engineering,biomedicine,and other research fields. |