| RNA interference is a special phenomenon of post transcriptional gene silencing.It is the degradation of double stranded RNA molecules into 21-23 nt small interfering RNA(si RNA)through the process of energy consumption,thus inhibiting the expression of homologous m RNA and playing an important role in regulating gene expression in various eukaryotes.This article investigates the problem of identifying small interfering RNA and constructs an effective prediction model,mainly using ensemble and deep learning algorithms.The specific research content of this article is as follows:(1)To solve the problem of low recognition accuracy of virus-derived small interfering RNAs(vsi RNAs)in plants,this thesis uses deep learning and ensemble learning methods to construct a prediction model.We build a multi-layer framework based on convolutional neural network,multi-scale residual network,and bidirectional Long Short-Term Memory network with self-attention to learn sequence information for word2 vec and fast Text coding schemes.After selecting the optimal combination of parameters,we retained the five models with the top sensitivities.Then,by comparing different integration strategies,we adopted logic regression to integrate these five models as the final predictor,named COPPER.To further demonstrate the generalization of COPPER,we compared it with PVsi RNAPred on an independent dataset and evaluated its performance on the homology effect of si RNA sequences with different similarities.Moreover,an ablation study was conducted to determine the importance of each part in the model.(2)Phasic small interfering RNAs are plant secondary small interference RNAs that typically generated by the convergence of mi RNAs and polyadenylated m RNAs.A predictor called DIGITAL for predicting mi RNA trigger phasi RNA loci using deep learning methods was proposed.First,positive data were collected from the Tar DB database,5408 mi RNAs triggered 21-nt phasi RNA data and 443 mi RNAs triggered 24-nt phasi RNA data were obtained.Negative data were generated by randomly replacing a certain number of nucleotides in positive samples.Then,the biological sequence is encoded by one-hot,and the data is trained on the deep learning model based on multi-scale residual network and bidirectional LSTM.Bayesian optimization is used to fine-tune the key parameters according to the ACC value and select the best parameters.In addition,six traditional classification algorithms are constructed: support vector machine,naive Bayes,k nearest neighbor,extreme gradient lifting,logical regression and random forest,and compared with DIGITAL,and the results showed the effectiveness of our model. |