Font Size: a A A

Research On Identification Of Non-coding RNA Based On Stacking Model

Posted on:2020-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2370330590474440Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the improvement of the new generation of high-throughput sequencing technology and the large-scale comparative sequencing,a large number of available transcripts have been generated,and the distinction between coding and non-coding RNA has become a core task in the analysis of transcripts.There are two trends in the field of non-coding RNA identification.One is to meet the identification needs of a large number of non-patterned biological RNAs,which requires a species-neutral identification tools;the other is to design specific identification tools to meet the identification needs of specific species.To solve these problems,this paper designs a non-coding RNA identification framework.The non-coding RNA identification framework designed in this paper consists of two modules.The feature extraction module extracts the transcript sequences from three levels of DNA,RNA and peptide.The 17 effective features in the previous research are summarized from the levels of DNA and RNA.At the peptide level,8 physical and chemical features and secondary structure features of the protein were creatively selected.The classifier module designed a two-layer classifier based on the stacking strategy,which combines machine learning models RF,XGBoost and LightGBM and applies them to the field of non-coding RNA identification.This paper uses the Python language to implement a non-coding RNA identification framework,and implements a cross-species non-coding RNA identification model and a plant non-coding RNA identification model according to two different requirements.Among them,the cross-species non-coding RNA identification model is a species-neutral tool and performs well on a multi-species test set consisting of human,mouse,zebrafish,fruit fly,worm and Arabidopsis thaliana,with classification accuracy reaching 97.23%.In addition,specific models were trained for plant species,and the performance of the models was tested using Arabidopsis thaliana and zea mays data from Ensembl Plants.Meanwhile,the applicability of the models in plants was proved.During the test,the importance of the selected 25 features was analyzed,and the 5 features of Hexamer score,ORF length,Mw value,ratio of pI to Mw,and Turn value were proved to be the most important.among them,Mw value,ratio of pI to Mw,and Turn value were all peptide level features.
Keywords/Search Tags:non-coding RNA identification, protein feature, stacking strategy, RF, XGBoost, LightGBM
PDF Full Text Request
Related items