Font Size: a A A

DNA Sequence Feature Extraction Based On Statistical Feature

Posted on:2012-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Q G HuFull Text:PDF
GTID:2230330371463507Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the accomplishment of HGP (human genome project) and the research on different species gene sequences, people have obtained large number of genome sequences, which contain a wealth of information and hide the complex biological knowledge. It is challenges and opportunities for scientists that how to mining valuable information from massive data. The feature extraction of DNA sequence is of importance for understanding the structure and function of genomes. Genomic sequence features refer to extracting features from complex genome sequences which can embody their essence, with the utilization of mathematics methods and information science. This paper will present two different sequences based on statistical features extraction methods. These methods both are not sequence alignment, and contain more sequence information than these traditional methods, even have the low time complexity.The first new method based on statistical features is a combination of segmental probability of six nucleotides correlation factors and four components of the conventional nucleotides composition, which is based on counting appearance probability of nucleodides. It contains more sequence effects than the 4-D conventional nucleotide composition. In order to calculate simple, we deal with a DNA sequence for segmentation way, which reduces the time complexity. The segment value is an arbitrary value and will not affect the results.Another new statistical feature method is proposed with based on information theory. We use information entropy and mutual information theory in information theory, and calculate the probability of singe base and Dinucleotide as the event probability. We use mutual information of information theory obtain 16 mutual information between four bases. So a DNA sequence feature can be defined by 16 mutual informations. This method conbines the idea of information theory to get more sequence effects information, which has simple calculation.Statistical features methods have widely applications, for example, different functional areas of genes distinguishing, the sequence analysis, phylogenetic analysis and gene classification as so on. In this paper, two new methods are mainly used in phylogenetic analysis. Based on these methods extracted features, we analyze the similarity of species, and construct the phylogenetic tree based on the distance matrix. We often use Neighbor.exe program of PHYLIP software to assess phylogenetic tree, and verify two methods applicability by the experiment.
Keywords/Search Tags:DNA sequence, features extraction, statistical features, information theory, phylogenetic tree
PDF Full Text Request
Related items