| Circular RNAs(circ RNAs),a kind of endogenous RNAs,are widespread in eukaryotic cells.It forms a covalently closed loop structure with no 5’ caps or 3’ polyadenylated tails.Circ RNAs are usually deemed to be non-coding RNAs.According to the traditional molecular biology theory,precusor RNAs(pre-RNAs)produced from DNA template chain by transcription,and can be processed into linear messager RNA(m RNA)by canonical RNA splicing,in which introns are removed and exons connect together in genomic order.While circ RNAs are derived from precursor RNAs by back-splicing and headto-tail are connected by covalent bond to form a closed-loop structure.With the rapid development of high-throughput sequencing technology and the development of bioinformatics tools,more and more circ RNAs have been found,which can regulate the expression of disease-related genes and play a vital role in the process of cell physiology.Therefore,the research of circ RNA has become a hot topic in the field of transcriptome.For a long time,circ RNA has been regarded as non-coding RNA which can not encode proteins.Until recently,researchers confirmed that individual endogenous circ RNA can encode proteins,which opened up a new understanding of the proteins-coding circ RNA.Research has shown that about 10% of circ RNAs can encode proteins,which means that a large number of protein-coding circ RNAs have not been found.Because the experimental verification process is tedious,time-consuming and costly,this study analyzes the circ RNA protein-coding potential based on the sequence and structural characteristics,and develops a prediction tool to identify the proteincoding circ RNAs in order to facilitate researchers to narrow the scope of the experimental objects and find more protein-coding circ RNAs.The main work of this paper is as follows:First of all,the sequence and structural characteristics of the protein-coding circ RNAs were analyzed.The circ RNA sequence is extracted according to the reference genome,and predicted the open reading frame(ORF)of circ RNA based on the circ RNA sequence,the circ RNA with open reading frame has the ability to encode proteins.The translation initiation of circ RNA depends on the internal ribosome entry site(IRES),in this way we use the existing tools IRESfinder and VIPS to analyze whether circ RNA contain IRES.We also applied the existing transcriptional protein-coding prediction tools to predict the coding potential of circ RNA,and then used Phast Cons and Phylo P to analyze the sequence conservation of circ RNA.It has been proved that m6 A modification sites in circ RNA can drive the translation initiation of circ RNA.Based on this conclusion,we apply the existing m6 A modification prediction tool SRAMP to predict the m6 A modification sites in circ RNA.In order to further study the sequence and structural characteristics of the protein-coding circ RNA,we calculated the GC content and the starting codon(AUG)content in circ RNA by self-compiled python program,and used the improved k-mer feature,calculate the frequency of special-k-mer in circ RNA sequence.Secondly,the machine learning method is used to predict the protein coding potential of circ RNA.The positive samples of the training set were divided into four types of protein-coding circ RNA datasets with different confidence levels,which were respectively derived from literature verification,mass spectrometry data support,Ribo-seq support and circ RNA annotated as protein coding potential in database circ RNAdb.The negative sample comes from the random selection of the database Circ Base,and the sample size is the same as that of the positive sample.We use all the features of the training set to train logical regression,support vector machine,random forest and XGBoost models respectively.Through the ranking of feature importance,we select important features to prevent over-fitting of the model.Finally,we evaluate the performance of each classification model.The classification performance of each model is evaluated by ten-fold cross-validation and independent test set verification respectively,in which the classification performance of XGBoost is the best,followed by random forest and support vector machine.We use the combined model of XGBoost,random forest and support vector machine as the classification model of whether circ RNA can encode proteins.The average AUC of the ten-fold crossvalidation of the classification model reached 0.9369,the prediction accuracy was 86.66%,and the prediction accuracy on the independent test set was 73.53%,13 of the 17 encoded protein-coding circ RNAs verified in the literature were verified as positive samples.We integrate this method into a bioinformatics prediction tool Circ CAD.In order to verify the powerful function of Circ CAD,we take all the circ RNAs in Circ Base as input,and the results show that about 14.9% of the circ RNA has the ability to encode protein.We analyzed GO enrichment of the mother gene derived from the predicted protein-coding circ RNA of brain tissue.The enrichment result shows that the most significant biological process is positive regulation of hydorlase activity,indicating that the protein-coding circ RNA may regulate cell life activities through this pathway. |