Font Size: a A A

The MRNA Multi-feature Recognition Method Binding Secondary Structure From LncRNAs

Posted on:2017-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2180330482489361Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the human genomes today, there are approximately 70% transcripts that can be transcribed. Among these transcripts, only 1% to 2% protein can be used for encoding. But in recent researches of transcriptome, people find out that, although many transcripts are classified as the non-coding RNA, their unique biological functions still exist. Among these non-coding RNA(s), the long non-coding RNA which is more than 200 nt occupies the dominant position. It is also mixed with plenty of complicated biological functions. So its indispensable role in the regulating these biological functions gradually causes people’s focus in these days. Due to the high-throughput sequencing has been widely applied, thousands of transcripts are reconstructed. And effectively and accurately screening protein-coding RNA in lnc RNA becomes the most challenging work.This thesis aims at extracting the obvious classified features between m RNA and lnc RNA by exploring the difference of their hidden information. And SVM classifier would be used as the tool for realizing the classification of m RNA in lnc RNA. Starting from the features of the RNA sequence and the secondary structure, this thesis will try to extract three classifications. They are namely ORF, the similar feature of protein sequence and the structural thermodynamic free energy. Each classification includes several of detailed indexes. Based on the feature of ORF, take the integrity rate, coverage rate and ORF as one data; based on the similarity feature of the protein sequence, take the matching number of the protein pool as one data. The matching between protein pool and E-value mean and reading frame would be scored. The two indexes of the secondary structure is the ratio between the normalized value of minimum free energy and GC content and the ration between the normalized value of minimum free energy and the number of base pairs. Each index shows that there is obvious difference between m RNA and lnc RNA. This thesis respectively takes 1800 data from the lnc RNA database of NONCODE and the m RNA database of Refseq as the training data set. And combine the support vector with these divisions formed by different features to get various RNA recognition methods. Among them, the best predicted method would be found out. In the part of verification after training, this thesis would choose two means for verification:cross-validation and the validation of m RNA data set. In the cross-validation, sensitivity, specifity and accuracy are chose to be the three evaluation criteria. The result shows that the prediction result of the similar dual features which combines the secondary structure and the protein is better than others. In result of the dada set validation, the similarity accuracy of the of the protein sequence is 84.9%. The accuracy of the mixed features is 63.4%, and the similarity of protein sequence and ORF is 59.8%. The accuracy of these thre e methods are higher than CPC.
Keywords/Search Tags:Bioinformatics, lncRNA, mRNA, secondary structure, feature recognition
PDF Full Text Request
Related items