A Long Non-coding RNA Prediction Model Based On Multi-feature Fusion

Posted on:2019-07-19

Degree:Master

Type:Thesis

Country:China

Candidate:D L Jiang

Full Text:PDF

GTID:2370330563958564

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Only a very small part of the genome in an organism can encode proteins.Most transcriptomes are non-coding RNAs that cannot encode proteins,and longer than 200 nucleotides are defined as long non-coding RNAs.More and more studies have shown that lnc RNAs play an important and extensive biological function and have important significance in maintaining the activities of organisms.lnc RNAs are specifically expressed,and their number far exceeds the annotated.At the same time,with the advancement of next-generation sequencing technology,genes in different organisms have been sequenced,providing sufficient lnc RNA candidates.Therefore the identification and description of new lnc RNAs from RNA sequences through machine learning methods has important biological significance.In this paper,three kinds of characteristics of RNA sequences are extracted,including the sequence features,secondary structure features and functional features.Sequence features include k-mer features and ORF features,and secondary structure triples represent secondary structure features.Functional properties include pseudo-nucleotide features based on the physicochemical properties of dinucleotides and minimal free energy feature in the folding of secondary structures.In order to solve the problem of imbalance between positive and negative samples,a improved K-means clustering method is used.Mesh search is applied to find the best parameters in pseudo-nucleotide features extraction.In order to remove redundant features and find the most relevant feature set,the paper proposes an integrated feature selection method based on maximum correlation and minimum redundancy.Considering the feature selection methods such as information gain,Pearson correlation coefficient,Relief algorithm and random forest as the largest correlation evaluation index,the minimum redundancy index evaluation is Pearson correlation coefficient.Finally,a SVM classification model is constructed because it has significant advantages in solving nonlinear problems.The experimental results on Arabidopsis sequence datasets show that the integrated feature selection method proposed in this paper can select fewer features and build a classification model with good classification performance,and it is more effective than CPC,CPAT,Lncrna-pred methods.

Keywords/Search Tags:

Feature extraction, Ensemble Feature Selection, Non-coding RNA Recognition

PDF Full Text Request

Related items

1	Research Of Emotion Recognition Based On Multi-domain Eeg Features And Integration Of Feature Selection
2	Study On The Methods Of Feature Extraction And Recognition Of Ships In SAR Imagery
3	Gesture Recognition Based On SEMG Signal
4	Feature Extraction And Selection Research Of Sub-health Recognition Based On Pulse Wave
5	Research On Feature Recognition Method Of Weak Emission Line Of LAMOST Low-resolution Spectra
6	Prediction Of Non-coding RNA Based On Feature Selection And Integration Algorithms
7	Hail Recognition Index Design Based On Feature Extraction Of Radar Images
8	Research On Emotional Feature Extraction And Classification Of EEG
9	Research On Feature Engineering And Feature Selection Algorithm Of Biogenetic Data Based On CNN
10	Prediction Of Amidation Sites Based On Ensemble Learning