Font Size: a A A

Research On RNA Related Function Sites Based On Machine Learning

Posted on:2021-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y BiFull Text:PDF
GTID:2370330602989012Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
RNA plays an important biological role in gene encoding,decoding,regulation and expression.In this paper,we mainly use traditional machine learning and deep learning algorithms to study several important RNA related functional sites,including the prediction of the binding sites of circular RNA and RNA binding protein(RBP),and the identification of RNA pseudouridine sites which could change the complementary base pairing,the prediction of n7-methylguanosine(m7G)site 'and the classification of insect cuticular protein CPR family related to RNA interference.The specific research content is as follows:(1)Different from traditional linear RNA(containing 5' and 3' ends),circular RNA(circRNA)is a special type of RNA that have a closed ring structure.In order to better understand the regulatory function of circRNA,it is necessary to get an insight into the interaction mechanism between circRNA and RBP.We propose an ensemble neural network,termed PASSION to predict RBP sites on circRNA,which is based on the concatenated standard neural network and hybrid deep neural network frameworks.Specifically,the input of the standard neural network is the optimal feature subset for each RBP,which has been selected from six types of feature encoding schemes through incremental feature selection and application of the XGBoost algorithm.In turn,the input of the hybrid deep neural network(convolutional neural network and long short-term memory neural network)is a stacked codon-based scheme.The results of 37 groups of benchmark experiments show that PASSION has strong competitiveness in recognizing the binding sites between circular RNA and RBP.(2)In view of the specificity of ? modification,we propose an ensemble approach to identify pseudouridine sites,named EnsemPseU.First,five sequence-encoding strategies,namely,kmer,binary encoding,enhanced nucleic acid composition,nucleotide chemical property,and nucleotide density were used to extract features.Then,chi-square feature selection was applied to reduce the feature dimensionality and remove redundant information.Finally,an ensemble model integrating support vector machine,XGBoost,naive Bayesian,k-nearest neighbor and random forest,was used to build our prediction model.Upon evaluation via 10-fold cross-validation and an independent test,our proposed model EnsemPseU outperformed the other existing model.(3)N7-methylguanosine(m7G)is a type of positively-charged mRNA modification that is essential for efficient gene expression and cell viability.Bioinformatics tools can be applied as an auxiliary method to identify m7G sites in transcriptomes.In this study,we develop a novel predictor called XG-m7G to identify m7G sites,based on the XGBoost classification algorithm and six different types of feature encodings.Moreover,by using the powerful SHAP algorithm,this new framework also highlights the most important features for identifying m7G sites.(4)A prediction model based on convolutional neural network is constructed by using amino acid composition and amino acid pair composition features to identify the CPR family sequence of insect cuticular protein.
Keywords/Search Tags:Site Prediction, RNA Binding Protein Sites, RNA Modification, Deep Learning, Machine Learning
PDF Full Text Request
Related items