Font Size: a A A

RNA-protein Interactions Prediction Based On Data Augmentation

Posted on:2022-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:J R YanFull Text:PDF
GTID:2480306551470634Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
RNA binding protein plays an important role in biological functions through binding to RNA,such as m RNA localization,mediating RNA localization and translation,etc.RBP is ubiquitous proteins that could bind to RNA.Studies have shown that most of the RNA play roles must be combined with protein.RBP binding to specific RNA is also called RNAprotein interaction,that is,there are binding sites.The interaction of RNA-protein is an important research topic in the field of biomedical engineering currently.How to use computational methods to improve its prediction accuracy has become a hot spot.At present,traditional machine learning methods are difficult to extract features,due to understand lack of interaction mechanism and binding features.There is no recognized feature extraction method has emerged yet.Most of the researches have focused on deep learning in this field.However,binding sites prediction of deep learning still have the following shortcomings.Firstly,most deep learning methods only consider sequence information and ignore secondary structure information.Secondly,some RBPs with lower accuracy are investigated,which generally have the small size dataset.Deep learning models require massive data,so the prediction ability of these RBPs cannot be improved.Thirdly,parameters are initialized randomly in the training.The performance of prediction cannot be improved significantly.Finally,most of the methods only consider the existence of binding sites,ignoring the sequence specificity when binding occurs.In view of above,the research contents and research achievements include the following.Firstly,RNA sequence and structure information were constructed.On the basis of RNA sequence information,the secondary structure information of RNA was added.sequence and structure are encoded by one-hot.After encoding,sequence and structure are represented as numerical tensors.Secondly,a generative adversarial network was presented to augment dataset.8 RBPs that prediction accuracy is lower than the average AUC were selected from 24 RBPs.Generated adversarial network was built to augment 8 RBPs scale.Generator and discriminator were trained alternately.After training,high quality synthetic data was generated by generator.The average AUC of 8 RBPs were predicted and analyzed before and after data augmentation.It was found that the average AUC increased after data augmentation.At the same time,it is verified that data enhancement can improve the learning ability and prediction ability of the prediction model.Thirdly,a convolutional autoencoder based on sequence and structure features was proposed.In the pre-training,sequence and structure data were trained by CAE respectively in an unsupervised way.In the fine-tune,the trained sequences encoder and structure encoder were concatenated,and next two LSTM network layers were added to capture the long-term dependent information of sequence motif and structural motif.Motif is a common short subsequence or substructure in RNA that could bind to RBP.Here,it can be regarded as the feature of RNA.24 RBPs was trained by CAE.Compared with existing studies,the average AUC has been improved to some extent.Finally,the sequence motif and structure motif were extracted.Parameters in the convolution kernel record the weight of base position.The trained convolution kernel of the first convolution layer in the sequence and structure encoder respectively do convolution operation with sequence and structure.A value was got in every site.A value higher than the threshold,corresponding subsequences are extracted.All short sequences are analyzed and common motif is obtained.The results showed that the motif was consistent with the motif verified by experiments.In addition,our model can also extract unknown motifs from public dataset to provide basis for further exploration of binding characteristics.Experiments are designed based on above research,and validated on public datasets.Compared with exist studies,the prediction is improved by the proposed method(the average AUC achieves 0.939).Further,in order to prove the model effectiveness,a Web system of Protein-RNA interactions prediction based on data augmentation was designed and implemented.Firstly,the system receives an arbitrary RNA sequence from an unknown binding site as input to predict the corresponding secondary structure.The sequence and secondary structure are encoded by one-hot respectively.Secondly,the binding probability of the RNA sequence and RBPs was calculated by calling the trained model.And the corresponding binding probability value was returned to view.Finally,the Web system could provide an analysis platform for researchers,and supply the binding preference of RBP.It inspires for exploring the binding mechanism of RNA and protein.
Keywords/Search Tags:RNA-Binding Protein, Generating Adversarial Network, Data Augmentation, Convolutional Auto Encoder, Long Short-Term Memory
PDF Full Text Request
Related items