Font Size: a A A

Protein Remote Homology Detection Based On Deep Learning

Posted on:2019-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:S M LiFull Text:PDF
GTID:2370330566998660Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of next generation of sequencing techniques,the quantity of protein sequences grows up with exponential rate.However,due to the limitation of human knowledge,high cost of experimental identification,the number of protein structure and function information grows slowly.How to make prediction of proteins' structures and functions based on sequence information is one of the central problems in the field of bioinformatics.The remote homology relationships between proteins refers to the proteins have low sequence similarities,but similar in their structures and function.Through protein remote homology detection,it is possible to make the first inference about the structures and functions of unknown proteins,therefore,its main goal is classifying the unknown proteins into a certain of protein superfamilies.Nevertheless,the performance of traditional machine learning based methods heavily rely on the quality of protein feature vectors.However,much important information is lost during the vectorization process.One advantage of deep learning techniques is that they can automatically extract feature vectors from raw data.Thus,the aim of this study is to use deep learning networks for protein remote homology detection,which can automatically extract protein features with greater discriminative power.The contents of this study including:First of all,we have proposed a protein remote homology detection method ULSTM which based on Long Short-Term Memory(LSTM).Through taking use of every intermediate hidden value and time distributed dense layer,ULSTM can better handle long protein sequence and fuse dependency information.ULSTM has achieved good performance on the benchmark dataset.The mean ROC score and mean ROC50 score of ULSTM-One Hot are 0.965 and 0.794,respectively.And the mean ROC score and mean ROC50 score of ULSTM-PSSM are 0.985 and 0.925,respectively.Both the performance of ULSTM-One Hot and ULSTM-PSSM surpass other related methods,indicating due to the designed structure of ULSTM,it can extract more discriminative protein features from protein sequences.Furthermore,an improved model for protein remote homology detection is proposed: BLSTM.Thanks to the more comprehensive information contained in the hidden values of bidirectional layer,the performance of BLSTM is improved.The mean ROC score and mean ROC50 score of BLSTM-One Hot are 0.965 and 0.810,respectively.For BLSTM-PSSM,the mean ROC score is 0.986 and the mean ROC50 score is 0.923.The third method for protein remote homology detection based on convolutional neural network and LSTM is proposed in this study: CNN-BLSTM.CNN-BLSTM can first identify “important” subsequences,and then extract dependencies among these subsequences.When combined with PSSM,CNN-BLSTM-PSSM has achieved the best performance on the benchmark dataset among all of deep learning models(mean ROC score: 0.984,mean ROC50 score: 0.938).Also,by utilizing the techniques of visualization,CNN-BLSTM's ability of identifying protein patterns is illustrated.Moreover,the protein features extracted by ULSTM,BLSTM,CNN-BLSTM are aggregated to improve the prediction performance.To deal with this problem of lacking training samples in real-world applications,a framework which aggregates deep learning models and ranking methods are used in this study.The experiments showed that the aggregation between CNN-BLSTM-PSSM and the ranking method of HHblits has achieved state-of-the-art performance(ROC score: 0.998,ROC50 Score: 0.981),which suggests a more practical and high-sensitive method in protein remote homology detection.In general,we focused on protein remote homology detection by using deep learning techniques and achieved good performance.By aggregating deep models,the performance further improved.Finally,we employed a framework which combines ranking methods and deep learning models,making it can be widely used in an environment without enough training samples.
Keywords/Search Tags:protein remote homology detection, deep learning, long short-term memory, convolutional neural network
PDF Full Text Request
Related items