Font Size: a A A

Research On Similarity Analysis Method For Gene Data

Posted on:2009-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:J W LuoFull Text:PDF
GTID:1100360242990744Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the launch of Human Genome Project, as well as various studies on biological gene sequence, a growing number of molecular sequence data have been produced. The scientific analysis and processing on the sequence data has promoted the development of Bioinformatics. Sequence similarity analysis is the basis of bioinformatics, and the sequence information from similarity analysis can be used to deduce the gene structure, function and evolution relations, therefore the research on analytical method of gene data similarity has become a very important study in the field of Bioinformatics.On the basis of a summary of the gene pattern and the current research on analitical method of similarity, the paper gives a systematical research on cluster analysis method, measure of sequence similarity, the space pattern of gene data and the method of similarity analysis based on the space representation. The main achievements are summarized as below:Firstly, a dynamic clustering approach of gene expression on the basis of multi-dimensional pseudo F-statistics is proposed. The algorithm can dynamically adjust the clustering number, and get the best number in terms of different multidimensional pseudo F-statistics value. The experiment result shows that the algorithm will produce better clustering quality. Since the missing data in gene microarray have serious impact on clustering results, the paper applies the fuzzy C-means algorithm which can properly handle the overlap and relevance of data to the handling of missing gene expression data, raises the filling algorithm FCMimpute based on the Fuzzy C-means. The experimental results show that FCMimpute filling is a feasible, effective way to deal with missing values and its performance is particularly advantageous.Secondly, a clustering algorithm based on the dynamic matrix of similarity is proposed. In terms of DNA sequence, the research analyzes the clustering algorithm of gene sequence based on BAG clustering algorithm, and gives the initial value of cutoff, the minimum length of the threshold value and method for establishment of segmentation / merger, raises the clustering algorithm based on the proposed similarity dynamic matrix. The experimental results show that this technology has good clustering correct rate.Thirdly, we propose a novel method for sequence similarity analysis based on the relative frequency of dual nucleotides. In view of the complex calculation of DNA sequence, the paper analyzes the neighboring dual nucleotides of DNA sequence and raises the sequence similarity analysis method based on the relative frequency of dual nucleotides. The results showed that this method can effectively express sequence similarity with simple calculation.Fourthly, we present a graphical representation of DNA sequence, define the parameter of sequence and raise constructing algorithm for the phylogenetic tree of hierarchical clustering. In view of the degeneracy of the graphic representation of DNA sequence, this paper presents a 3D curve representation - N curve which proves that N curve does not exist circle and degeneracy, and N curve satisfies with the symmetry of DNA sequence; defines a new matrix invariant: Z_inv. The experiment shows that the parameter is easy to calculate and very close toλ. We propose a constructing phylogenetic tree algorithm based on hierarchical clustering, the experimental result shows that the algorithm is effective.Fifthly, we propose the 2D, 3D, 4D spatial representation of RNA secondary structure, and do the simiparity analysis on the secondary structure of RNA. High complexity and degeneracy are major problems in RNA secondary structure representations. The paper propose the 2D, 3D, 4D spatial representation of RNA secondary structure and proves the validity of the representation. Then the similarity analysis on the secondary structure of RNA is conducted by using the matrix invariant, to prove the effectiveness of the method through the comparision experiment concerning similarity and dissimilarity of different RNA secondary structure.Sixthly, we propose a 6D representation of protein sequence and define a distance measure of proteome similarity. A 6D representation of protein sequence according to the classification of 20 amino acids is presented. We prove the validity and some numerical properties of the representation, and then propose a distance measure for protein sequences similarity, and construct phylogenetic tree using the corresponding similarity matrix. Unlike most existing phytogeny construction method, the proposed method does not require multiple alignments. The experimental result shows that the method is effective.
Keywords/Search Tags:similarity analysis, cluster analysis, spatial representation, phylogenetic tree, DNA sequence, RNA sequence, protein sequence, proteome
PDF Full Text Request
Related items