Font Size: a A A

Identifying Splicing Sites Of Circular RNA Based On Deep Learning

Posted on:2022-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:K SunFull Text:PDF
GTID:2480306575969679Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
There are many kinds of RNA in organisms with different functions.As one of them,noncoding RNA plays an important regulatory role in organisms,and circular RNA(circrna)is the most representative type of noncoding RNA.Based on the important biological regulation role and great potential medical value of circrna,it is of great significance to accurately identify circrna.Through a deep learning algorithm combining convolutional neural network(CNN)and long-term and short-term memory network(LSTM),this paper designs a recognition model of circrna splice sites,deepcircrna,and provides an online recognition tool based on this model.With the increasing number of circrna sequencing data,more and more circrna prediction software appear,which greatly improves the accuracy and efficiency of circrna discovery.However,they are based on sequencing data,that is,RNA SEQ technology,and use comparison methods or machine learning algorithms to identify and predict,which has the following disadvantages: high experimental cost and long experimental cycle,The accuracy is not high,and there are not many negative data sets,which is difficult to obtain.Therefore,based on the current research status,this paper uses CNN,which has advantages in spatial feature extraction in deep learning,and LSTM,which is very suitable for processing biological sequence data,to design a circular RNA splice site recognition model for human and Arabidopsis from the perspective of splice sites for genomic data,that is,DNA sequences,which overcomes the disadvantages of the above methods,It has the advantages of direct data download,no experiment and sequencing cost,and high accuracy.The research work and innovation of this paper include the following four parts:(1)select the appropriate length of splice site region for human and Arabidopsis,and process each base sequence into a two-dimensional vector for the input of the model.(2)The usual sequence recognition problem is to directly input the sequence vector into the model.In this paper,the second input,GC content,is added to the single input and single independent variable model to improve the performance of the model.(3)Whether it is DNA,RNA sequence or protein sequence related recognition research,it is usually used as the coding method to deal with biological sequence.Few people has discussed the impact of different coding methods on the performance of the model in the field of circrna recognition.This paper experiments,analyzes and discusses the impact of eight different coding methods on the model.(4)An online circrna recognition tool is designed and made to facilitate people in related fields.It can not only identify a single group of DNA sequences,but also upload files,identify batch sequences,and realize simple visualizationAfter model training,testing and comparison,the results show that the accuracy of deepcircrna model can reach about 0.95 on human test data set and about 0.92 on Arabidopsis test data set.Compared with other machine learning algorithms,the results show that the performance of deepcircrna is the best.
Keywords/Search Tags:circRNA, splicing sites, convolutional neural networks, long short-term memory networks, GC content
PDF Full Text Request
Related items