Font Size: a A A

Construction Of Prediction Model For Protein-RNA Interaction Using The Deep Learning Methods

Posted on:2018-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:K Y ZhangFull Text:PDF
GTID:2310330518465283Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Objective:Protein-RNA interaction(PRI)is a kind of biological macromolecular interaction,which is closely related to many biological processes such as gene expression regulation.For example,the interaction between bacterial s RNA(small regulatory)Csr B and its protein target Csr A can regulate carbon uptake,cell motility,biofilm formation,quorum sensing and bacterial pathogenicity.In eukaryotes,many nc RNA(non-coding RNA)play a varity of functions by binding with proteins.Therefore,it is important to develop a PRI prediction model with excellent performance,which will provide bioinformatics support for experimental studies of PRI.Current prediction methods for PRI can be classified into four categories,which are RNA-binding protein residues prediction,protein binding small RNA fragment prediction,sequence-based PRI prediction,and binding site-based PRI prediction,respectively.The models in the first class can predict residues bound to RNA in the protein sequence,but they cannot be applied to predict which RNA molecule to interact with.Based on the models in the second class,we can find out the information of the RNA domain taking part in PRI,but we cannot find out the protein partner.For the models in the third class,they can predict whether a given protein can interact with a given RNA,however,they can not determine the detailed information of binding sites.Although the models in the fourth class can give the binding sites,they also produce high false positive rate.Therefore,the methods above emphasize the different aspects of PRI.After reviewing the models above systematically,we intend to develop new models in the third class,i.e.,sequence-based PRI prediction models.On one hand,the models can not only predict whether a given protein can interact with a given RNA,but also provide input for the fourth models,which can reduce false positives and improve prediction efficiency.Currently,the traditional machine learning methods are often applied to develop the models for PRI prediction.For the improvement of the models performance,we have to understand deeply which features are closely associated with PRI.Additionally,it is also difficult to determine the optimal weights for those selected features.Furthermore,the models often over fit easily in training dataset,i.e.,the features and their weights are fully applied to the training set,but they cannot ensure the same performance on the test dataset.To overcome the shortcomings above,we tried to develop new models for PRI using deep learning methods.To the best of our knowledge,we have not seen the reports on the application of deep learning in developing PRI prediction models.Methods:For building PRI prediction models,we firstly constructed the training set and test set.We downloaded 1370 protein-RNA complex data from the PDB database with resolution less than 5.0 ?(by February 6,2017).The complex data were screened by length(>30),redundancy(<50%)and similarity(<70%),and 3761 PRI pairs were obtained,including 1432 protein fragments and 765 RNA fragments.We take them as the positive training daytaset.The negative training dataset was constructed as follows.The protein and RNA fragments were randomly selected from the complex data above,The interaction pairs with high similarity(>70%)were removed after comparing with the positive samples,and finally get the corresponding negative samples.The number of negative samples is about 10 times of the number of positive samples.When developing models,random sampling method is used to generate the negative data set with the number of samples as those of the positive samples in the training dataset.Besides the datasets above,other three public datasets RPI2241,RPI369 and RPI12737 were tested.The RPI2241 have 2241 PRIs,which were extracted from the PRIDB database.The RPI369 dataset contains 369 PRIs,which is a subset of RPI2241,with protein-r RNA complexes removed.RPI12737 dataset is composed of 12737 PRIs confirmed experimentally,which were extracted from NPInter V2.0 database.For each PRI,the sequence and secondary structure-based features were extracted,and those features were transformed using restricted Boltzmann machine(RBM).Finally,each PRI is represented by a feature vector containing 1024 elements.Based on the training set,we construct the prediction model,DLPRI,using the convolution neural network in deep learning.The Model DLPRI has 7 layers excluding the input.The input is a matrix with the size 32 x 32.The sliding window size is a little matrix with the size 5 x 5.The first layer C1 is an accumulation layer,which has a total of 28 x 28 nodes and six different C1 layers.Each C1 layer has the same weights.The Re LU function is used as the activation function of the convolution network,to ensure that the feature mapping has the invariance of the displacement.The second layer S2 has 14 x 14 nodes and six layers.By taking the down sampling method,the four points from C1 layer is averaged and assigned to the one point of S2 layer.Therefore,the size of each feature map is the 1/4 C1,with the row and column being 1/2 of C1.The same operations are applied to the layers C3 and C4.For the layers C5,C6,and C7,one-dimensional fully connected was applied.Result:We have tested the models using ten fold cross validation method(10-fold crossvalidation).The classification accuracy of DLRPI on the training set reach 96.7%.The sensitivity and specificity are 91.2%,and 93.4% on the test dataset,respectively.The DLRPI performances on the public datasets are better or comparable to the those from the traditional models.Conclusion:In this paper,we constructed two prediction models,DLPRI and DLPRI_S,for PRI using deep learning methods.Through combining the process of sampling and convolution,the computing time is reduced significantly and the model generalization ability and robustness are improved.These two models have better sensitivity and specificity compared with other traditional methods.It can be seen that the prediction accuracy of PRI can be improved by deep learning methods.It can also be ensured that deep learning methods will have broad applications in bioinformatics.
Keywords/Search Tags:Protein, RNA, Deep learning, CNN, RPI
PDF Full Text Request
Related items