Font Size: a A A

A Long Non-coding RNA Prediction Model Based On Multi-feature Fusion

Posted on:2019-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:D L JiangFull Text:PDF
GTID:2370330563958564Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Only a very small part of the genome in an organism can encode proteins.Most transcriptomes are non-coding RNAs that cannot encode proteins,and longer than 200 nucleotides are defined as long non-coding RNAs.More and more studies have shown that lnc RNAs play an important and extensive biological function and have important significance in maintaining the activities of organisms.lnc RNAs are specifically expressed,and their number far exceeds the annotated.At the same time,with the advancement of next-generation sequencing technology,genes in different organisms have been sequenced,providing sufficient lnc RNA candidates.Therefore the identification and description of new lnc RNAs from RNA sequences through machine learning methods has important biological significance.In this paper,three kinds of characteristics of RNA sequences are extracted,including the sequence features,secondary structure features and functional features.Sequence features include k-mer features and ORF features,and secondary structure triples represent secondary structure features.Functional properties include pseudo-nucleotide features based on the physicochemical properties of dinucleotides and minimal free energy feature in the folding of secondary structures.In order to solve the problem of imbalance between positive and negative samples,a improved K-means clustering method is used.Mesh search is applied to find the best parameters in pseudo-nucleotide features extraction.In order to remove redundant features and find the most relevant feature set,the paper proposes an integrated feature selection method based on maximum correlation and minimum redundancy.Considering the feature selection methods such as information gain,Pearson correlation coefficient,Relief algorithm and random forest as the largest correlation evaluation index,the minimum redundancy index evaluation is Pearson correlation coefficient.Finally,a SVM classification model is constructed because it has significant advantages in solving nonlinear problems.The experimental results on Arabidopsis sequence datasets show that the integrated feature selection method proposed in this paper can select fewer features and build a classification model with good classification performance,and it is more effective than CPC,CPAT,Lncrna-pred methods.
Keywords/Search Tags:Feature extraction, Ensemble Feature Selection, Non-coding RNA Recognition
PDF Full Text Request
Related items