Font Size: a A A

Study On Several Problems In Biosequence Analysis

Posted on:2012-03-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:F YangFull Text:PDF
GTID:1100330332477487Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Massive amounts of DNA and protein sequence data have been generated with the use of high-throughput experimental methods, such as genome sequencing, DNA array. It is the main goal of bioinformatics how to understand and use the vast biology data effectively, and to demonstrate the biological significance behind these data. The alignment and clustering of sequences are important research areas of bioinformatics. This paper comprehensively investigates and discusses the multiple sequence alignment and the clustering of protein sequences. The main contents and production can be briefly summarized as follows:1. We discuss and analyze the limitation and algorithmic improvements of recent multiple sequence alignment tools. Multiple sequence alignment is one of basic theory and tools in bioinformatics and plays a vital role in structure modeling, functional site prediction, and phylogenetic analysis. In this paper, we review methodologies and recent advances in the multiple protein sequence alignment, e.g. speeding up the calculation of distances among sequences, employing iterative refinement and consistency-based scoring function, with emphasis on the use of additional sequence and structural information to improve alignment quality. And then we discuss the way of alignment quality evaluation and the alignment speed of programs.2. A method to improve the alignment quality of Kalign is proposed. Kalign is an often used method of multiple sequence alignment. However, the alignment quality of Kalign is not high on account of inaccurate estimate of the distances between sequences. In this paper, an algorithm is introduced to refine the alignment created by Kalign. Firstly, we calculate the distance of pairwise sequence according to the alignment coming from Kalign, and then a new guide tree is built from a matrix of pairwise distances between all sequences, using the UPGMA (Unweighted Pair Group Method Average) method. Finally, a new alignment is produced by a progressive alignment method. The above steps are repeated until convergence or until a user defined limit of iteration is reached. We use the BAliBASE 3.0 alignment benchmark set for the assessment of our method. The result shows that out algorithm achieve more accurate alignment quality than Kalign does. 3. We propose a new fast algorithm of multiple sequence alignment. In this algorithm, a method similar to BLAST is employed to estimate pairwise sequence distances. We take advantage of the space saving Myers and Miller algorithm for the alignment of great amount sequences. We tested our method by simulating alignments with varying average sequence length and number of sequences using ROSE, and compared its computational properties, e.g. running time, memory requirement and accuracy on alignments, with other the most commonly used algorithms. We demonstrate that our method is one of the fastest and the most memory-efficient programs, and becomes more accurate than other methods at high evolutionary distances.4. A new similarity measure of protein sequences is proposed in this paper. This measure, based on similarity of the conservation part of two sequences and the L-tuple frequency vectors, is considered fully the similar subsequences and dissimilar subsequences between two sequences. Then we cluster sequences using the new proposed Affinity propagation (AP) algorithm. We tested our method extensively and compared its performance with other four methods on several datasets of COG, G protein, CAZy, SCOP database. We observed that, the new measure can express the similarity between sequences more effectively, especially between hard-align protein sequences. Moreover, in our experiments, the quality of the clusters was better than that of other algorithms.5. The limitation of Affinity propagation (AP) algorithm is analyzed when the algorithm is used for clustering a dataset generated randomly. We first point that no reason results can be obtained by using varied preference as input. Then we propose a post-processing method to improve the AP algorithm. This method uses the median of the input similarities as the shared preference value, and then employs post-processing phase combined mergence and reassignment strategy on the results of the AP algorithm. We have tested our method extensively and compared its performance with other five methods on several datasets of COG database, SCOP and G-protein family. In our experiments, the quality of the clusters is better than that of others, especially than that of original AP algorithm.
Keywords/Search Tags:Similarity measure, Sequence clustering, Affinity Propagation, Multiple sequence alignment, Kalign
PDF Full Text Request
Related items