Font Size: a A A

The Study Of Seeded Sequence Alignment Method

Posted on:2009-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:J L CaoFull Text:PDF
GTID:2178360272476538Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new interdisciplinary subject, which integrates information science, mathematics and biological science. It is the core of biotechnology. Broadly speaking, bioinformatics engaged in genome research on the acquisition, processing, storage, distribution, analysis and interpretation related to biological information. Specifically, bioinformatics uses the genomic DNA sequence analysis as a basic step to find the coding genes representing the RNA and protein in genome sequence and clarify the essence of the information existing in a large number of non-coding regions, decipher the law of the genetic language hidden in the DNA at the same time. Further it summarizes the data about the transcription spectrum and protein spectrum related to releasing and regulation of genetic information so as to understand the law of metabolism, growth, differentiation, the evolution of. In addition, it makes use of information of coding region in genome to simulate the space structure of protein and predict protein function, and integrates such information with physiological and biochemical information of the organisms and life processes to clarify the molecular mechanism, finally realize molecule design of protein and nucleic acid, drug design and health care design of mediator.Sequence comparison is the most basic and important operation in bioinformatics. The information of functions, structure and evolution in the biological sequence can be found through sequence comparison. And the basic operation of sequence comparison is sequence alignment. Alignment of biological information is an important method for bioinformatics. It can find the conservative regions and differences between the specified sequence through comparison of DNA and protein sequences, and provide an important evidence to further explore the relations between the structures, functions and evolution. The theorial basis of alignment is evolution theory. If there is sufficient similarity between the two sequences, it is assumed that the two are evolved by genetic mutation of a common evolutionary ancestor such as replacement, missing of the residues, and sequence rearrangementEspecially, the problem of multiple sequence alignment is the main research direction in bioinformatics and it's the NP-hard problem, the solution is still difficult. Multiple sequence alignment is to compare a set of sequences. It plays an important role in identifying the local conservative regions having important functions and structures. At the same time it can also check the global similarity and evolutionary relationship in a sequence family and it is the basis of molecular biology.The goal of homology search is to find similar segments or local alignments between biological molecular sequences. Under the framework of match, mismatch, and gap scores, the Smith-Waterman alogorithm guarantees to find the global optimal solution. However, the running time of the Smith-Waterman algorithm is too large to be used on real genome data. There are some algorithms of alignment proposed to accelerate the homology search, and a lot of programs have been put forward, such as FASTA, BLAST, NUMmer, QUASAR, and PatternHunter. BLAST is a most widely used tool for searching local alignments. The basic idea of BLAST is that two sequences have locally consecutive matches, which can be found by using a length seed, and a reasonably good alognment can be then generated by extending these local matches.However, Li et al found that the homology search sensitivity can be largely improved if long "gapped" seeds are used instead of short "exact" seeds.PatternHunter is developed based on long "gapped" seeds. PatternHunter applies a dynamic programming (DP) based algorithm to compute the hit probability of a given seed. The difference between BLAST and Pattern Hunter algorithm is to find the "spaced"seed to replace the short and precise seed.The problems of computing the sensitivity of a given spaced seed and finding the most sensitive pattern have been studied for a long time. So far, the PH algorithm is still the best running time algorithm for calculating sensitivity of a given spaced seed. And most algorithms for finding the optimal seed can not give any guarantee on the performance.In this thesis, based on analyzing original Pattern Hunter algorithm, we study and realize improved algorithms. They can compute the sensitivity of the "spaced"seed in a random region.The first algorithm improves the PH algorithm when the size of seed-compatible-suffix set is large, instead of considering all possibel suffixes b∈B, we consider only a small subset of B, which will result in an algorithm with a better running time when |B| is large. The second algorithm improves PH algorithm when the region length is large. We select some seeds to check the correctness and efficiency of the algorithms.We further develop a Monte Carlo algorithm which can guarantee to quickly find the optimal seed with high probability. At the same time, we use improved Pattern Hunter algorithm to realize the Monte Carlo algorithm, it can find the optimal seed with high probability through adjusting the parameter.At last we point out the further direction: multiple seeds and running time due to the length of seeds.
Keywords/Search Tags:homology search, spaced seed, bioinformatics, multiple sequence alignment
PDF Full Text Request
Related items