Font Size: a A A

Protein Function Prediction And Drug Target Discovery Based On Deep Learning

Posted on:2021-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:J J HongFull Text:PDF
GTID:2404330626451495Subject:Medicinal chemistry
Abstract/Summary:PDF Full Text Request
Discovering innovative drug targets is an important and a challenging task in the development of new drugs.The characteristics of target proteins can be deeply understood in the studies of protein functional classification,which makes contributions on target discovery.The premise of protein function classification is to annotating protein functions.With the development of multiple omics technologies,protein sequences have accumulated to large quantities.Traditional methods of annotating protein functions by experiments have been unable to make up for the increasing gap between the amount of annotated proteins and unannotated proteins.The computational protein function prediction method is an effective way to solve this problem.In recent years,it has become an indispensable method in the field of protein function annotation.However,traditional computational methods have disadvantages such as low accuracy and high false positive rate.Deep learning is one of the most promising artificial intelligence algorithms,and has achieved great success in the fields of medical diagnosis,genomics data analysis,and drug design.Using deep learning technology to fully extract protein features,constructing accurate and stable functional prediction models may make up for the shortcomings of traditional computing methods.Based on deep learning technology,this thesis makes the following research on protein function prediction.In the first step,a protein function prediction model was constructed based on a convolutional neural network(CNN)combined with a binary protein encoding strategy.This thesis collected proteins of 20 GO families and constructed two types of data sets for each family.Then,the performance of the model was compared with BLAST,HMMER,support vector machine(SVM),probabilistic neural network(PNN)and knearest neighbor(KNN)based on the two types of data sets of each family.The first type of data set had the highest similarity between the training dataset and the independent testing dataset.CNN,SVM,PNN and KNN all performed well on this type of data set,and no significant difference between any two methods was observed on any measurement.This result indicates that it is difficult to evaluate the performance of each method based on the first type of data set.The second type of data set had the lowest similarity between the training dataset and the independent testing dataset.The prediction accuracy of CNN on the 20 GO families was between 66% and 98%,which was better than SVM,PNN and KNN;and the specificity(SP)of CNN was between 87% and 100%,which means that it has achieved the best performance compared to SVM,PNN and KNN in controlling the false positive rate.In order to further evaluate the false positive rate of the CNN model in the real world,proteins encoded by the human genome was collected,and all models based on each GO family were used to predict these proteins.Finally,the enrichment factor(EF)was calculated based on the prediction results.The EFs of CNN are all above 2 and significantly higher than BLAST,SVM,PNN and KNN.However,compared with HMMER,it performs better on most GO families.Above all,this further proves that the CNN model constructed in this study has a good ability in controlling false positive rate.In the second step,the model was further applied to the annotation of bacterial type IV secretion system effector protein(T4SE)in this study.T4 SE is a factor that plays a vital role in the process of bacterial invasion.Studying the molecular mechanism of its role and understanding its characteristics is of great significance for drug target discovery,inhibition of bacterial type IV secretion systems and bacterial resistance research.All these studies are based on the identification and annotation of T4 SE.However,the current T4 SE prediction methods have some disadvantages such as high false discovery rate.Therefore,T4 SEs and non-T4 SEs were collected from other research for establishing and evaluating new T4 SE prediction models.Moreover,the protein feature representation methods were explored.Based on each protein feature representation method,a CNN model was established for predicting T4 SE.The evaluation was performed by predicting the independent testing dataset.The models based on three kinds of features(protein secondary structure and solvent accessibility,position-specific scoring matrix and sequence One-hot coding technology is predicting)achieved the best performance(the accuracy was 95.6%,98.9%,and 96.7% respectively),and was comparable to or even higher than Bastion4.Moreover,the results of predicting bacterial genomic proteins shows that the above three methods also perform best in the controlling of false positive rate(the EFs are 6.72,6.84,and 6.44,respectively).In order to consider the characteristics of proteins more comprehensively and further improve the reliability of prediction results,the T4 SE annotation tool CNN-T4 SE based on three methods above was established.Its false positive rate in predicting non-T4 SEs was evaluated and also reached the best performance.In this thesis,a protein function prediction model and a bacterial type IV secretion system effect protein recognition model were constructed based on convolutional neural networks.Both of them achieved better performance than BLAST,HMMER,SVM,PNN,KNN,Bastion4,T4SEpre_bpbAac,T4SEpre_Joint and T4SEpre_psAac,and were expected to provide reference for the protein function research.In addition,since the protein function prediction made in this thesis is a preliminary research for drug target discovery,it can be applied to target protein prediction research in future work,making contributions to profiling drug target characteristics and improving the efficiency of drug target discovery.
Keywords/Search Tags:drug target, protein, function prediction, deep learning, convolutional neural network, machine learning, false positive rate, bacterial type ? secretion system, effector protein
PDF Full Text Request
Related items