Font Size: a A A

Research On Characteristic Analysis Of Genome Words Constitution And Its Applications

Posted on:2013-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:J SongFull Text:PDF
GTID:2250330392968910Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, bioinformatics has focused on the characteristics of DNAsequence data. It explores the functional areas of the DNA sequences using thebases information, mines the potential sites that may contain the functional andresearches the genome information of the bases. The paper is to mine the functionalwords and signals in DNA sequence. During the little information in the non-codingregions, it is difficult to understand the DNA sequence accurately.The paper proposes a sequence segmentation tool using conditional randomfield, which answers the problem of mark paranoid and can add features freely. First,analyze the English word sequence and select the language-independent features foranalysis, then it is found that the improved entropy contains the maximuminformation. Finally, the accuracy rate of segmenting the English sequence is above90%. The common feature between DNA sequence and English sequence is the littlecharacter set and the language-independent features. Considering the transferlearning, analyze the eigen-value of the DNA and English sequence, it is found thatthe eigen-value of the two samples can be linked by a transition function. Theeigen-value of English sequence, which is transferred by the transition function, ismapped to the sample space of the DNA sequence. Then, without considering thetransfer learning, split the sequences using the existed information. Compared theresults between the two sequences, it turns out that the recall ratio of the transferlearning is about80%, the other is only about40%. It proves that the accuracy oftransfer learning is better than before.In the end, the paper presents the application of word sequence. The speciessimilarity between people and orangutan is more close to the realistic value with theimproved sequence alignment than that with the vector space machine. Then choosepeople, orangutan and Arabidopsis as a set to calculate the similarity. It is found thatthe position of people is accordance with the true position in evolutionary tree. Itproves that it is much easier than before to solve the bioinformatics problems withsequence segmentation in word unit.
Keywords/Search Tags:conditional random fields, improved entropy, transfer learning, species similarity, evolutionary tree
PDF Full Text Request
Related items