Font Size: a A A

Research On Algorithms And Theories In Genomic Sequence Annotation

Posted on:2005-08-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:D D LiFull Text:PDF
GTID:1100360155972195Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In bioinformatics, which is currently one of the most active and important cross-subjects for the moment, the information technology is used to study the characters of DNA and protein and their interactions. A kernel task of bioinformatics is to annotate the immensurable sequenced DNA sequences with high quality by computer. The algorithms for DNA sequence annotation are the precondition and foundation to develop the DNA sequence auto annotation software and then accomplish the annotation task. This thesis studies some theories and algorithms in this subject, proposes five new algorithms for three basic types of problems in DNA sequence annotation, and does some testing for these algorithms. The followings are the central work and the creative achievements in this thesis:1. The research on gene finding algorithms. The purpose of gene finding algorithms is to identify the whole structure of gene in DNA sequences. In this thesis, the general hidden Markov model is used as the frame algorithm of DNA sequences annotation system. To realize this algorithm, the optimal algorithm of hidden Markov model is studied, then, a novel simplify algorithm, which is based on the decomposability of the states in gene finding problems, is proposed, and its computational complexity is linear with the length of the input sequences. According to the structure characters of prokaryotic gene and eukaryotic gene, two training sequence sets are used to statistic the parameters for each model. Then, two gene finding programs, GeneMiner (for prokaryotic gene finding) and HumGene (for eukaryotic gene finding), are realized. Some testing for respective testing sets are done, and the results show that their performances have obtained the upstream level of the similar software at present.2. The research on the motif finding algorithms. Motifs are the conservative segments in biology sequences, and they usually mean regions with important functions. The task of motif finding is to try to induce and find these motifs implicated in a set of candidate sequences. There are two types of motif finding algorithms, non-exhaust search algorithm and exhaust search algorithm. Non-exhaust search algorithms' computational complexity is small, but they can't find all motifs in DNA sequences surely. Exhaust search algorithms can find all these motifs in principle, but they always need huge computing time and memory. In fact, because of the exploded number of search paths, it is difficult for exhaust search algorithm to complete the search process in reasonable time. To resolve this problem, this thesis proposed a novel exhaust algorithm, Criterion Search Algorithm (CRISA). It can accomplish the exhaust search with less computation resource. This target is achieved based on the criterion to describe the relations among three similar segments, which is deduced in this thesis. Using this criterion as pruning rule, CRISA can reduce the search space effectively in the deeply first search process. Theory analysis on CRISA is done in this thesis, and the results show that under some rather loose conditions, the computational complexity of CRISA is a polynomial function of the length and the number of the inputsequences. Then, some tests using simulated data and true biology data are executed, and the results show that it is more efficient than other exhaust search algorithms obviously, and its search speed is even faster than many non-exhaust search algorithms.3. The research on the standard describing the strength of motifs. In motif finding problem, an important theory problem is to estimate how strong a motif is. At present, the probability that the (l,d) motif randomly appearing in each sample at least once is used as such a standard. However, it's not accord with the practical computing results exactly. In this thesis, a new standard based on pass probability is proposed. It can give a more accurate measurement to the strength of motifs, and also accords with practical compotation results. On the base of this measurement, this thesis calculates the quantitative criterion of the strength of a motif that CRISA can find it.4. The research on the weak motif and combinational motif finding algorithms. CRISA need more compute time for weaker motifs. So, for very weak motifs, CRISA needs a mass of time to find them. To resolve this problem, a novel algorithm, the Candidate Motif Validation Algorithm (CMVA), is proposed. It is used as a pre-process before CRISA. Using the segments with special characters, CMVA constructs a candidate motif set, and then validate all motifs in this set. After this operation, these segments are deleted from the segment relation graph so as not take into searching process. So, the nodes that need searching in the graph are reduced, and the efficiency of the computation is improved. Using this method, an algorithm for combinational motif finding problems is realized. The tests results also indicate this algorithm is efficient5. The research on repeated sequence finding algorithms. The repeated sequences are segments that occur in DNA sequences many times, and the study of the repeated sequences is a start point for understanding the non-coding regions in DNA sequences. So, an algorithm that can effectively identify the repeated sequences in DNA sequences is needed. In this thesis, a novel projection-assemble algorithm for interspersed repeated sequences finding problems is proposed. Its computational complexity is linear with the length of the input sequence. The tests in simulated data and real biology data indicate this algorithm is efficient. This algorithm is a new endeavor in interspersed repeated sequences finding problem.In a word, this thesis works on three types of fundamental algorithms in genomic sequences annotation, gene finding algorithms, motif finding algorithms, and repeated sequences finding algorithms. These researches would be helpful for developing more perfect and efficient genomic auto annotation system. This thesis has realized these algorithms in computer, and they are now used in DNA sequences data mining.
Keywords/Search Tags:bioinformatics, genomic sequence annotation, gene finding, motif finding, repeated sequence
PDF Full Text Request
Related items