Font Size: a A A

Study On Algorithms For Biological Sequence Motif Discovery

Posted on:2015-08-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q YuFull Text:PDF
GTID:1220330464968870Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Motif discovery is to find over-represented sequence patterns in given sequences, playing an important role in locating significant sequence segments in biological sequences, such as transcription factor binding sites(TFBSs) and short linear motifs. Transcription factors control the transcription initiation and transcription efficiency of the target genes. The specific DNA sequences bound by transcription factors in the upstream region of genes are called TFBSs. Locating TFBSs accurately contributes to understanding the gene expression regulation mechanism. Linear motifs are segments in protein sequences with some specific functions, responsible for mediating protein-protein interactions and necessary for many regulatory processes, such as signal transduction, protein trafficking and post-translational modifications.The planted(l, d) motif search(PMS) is one of the most widely accepted fomulation for motif discovery. Solving PMS is a challenging task in both computer science and bioinformatics. Motifs are unknown and occur in sequences in a degenerated form. That is, motif instances(motif occurrences) are not exact copies of the motif, but differ from the motif in some positions. Comparing to DNA promoter sequences, protein sequences and Chromatin Immunoprecipitation-Sequencing(Ch IP-seq) sequences bring new challenges for solving PMS from the aspects of large alphabets and large data sets, respectively. Considering the characteristics of different biological sequences and the limitations of existing algorithms, this dissertation proposes new motif discovery algorithms in order to further improve the time performance and identification accuracy of motif discovery. The main content of this dissertation is summarized as follows.The first part focuses on the exact PMS algorithms for DNA promoter sequences. Existing exact algorithms have a high computation complexity or a high space complexity when identifying weak signal motifs. A new pattern-driven based exact algorithm named Pair Motif is proposed. How to generate candidate motifs from a pair of l-mers(l-length string) is analyzed. Some reference sequences are selected from input sequences by calculating the number of candidate motifs in advance, which effectively reduces the candidate motifs. Two rules for filtering l-mers to be scanned are designed, which contributes to accelerating motif verification. Comparisons with several recently proposed algorithms show that Pair Motif requires less storage space and runs faster on most PMS instances. Particularly, among thealgorithms compared, only Pair Motif can solve the instance(27, 9) within 10 hours.The second part focuses on the approximate PMS algorithms for DNA promoter sequences. Existing algorithms either output the exact results with a high computational cost or accomplish the computation in a short time but very often fall into a local optimum. A new pattern-driven based approximate algorithm named Pair Motif+ is proposed. At first, some pairs of l-mers are extracted from input sequences according to probabilistic analysis and statistical method so that one or more pairs of motif instances are included in them. Then an approximate strategy for refining pairs of l-mers with high accuracy is adopted in order to avoid the verification of most candidate motifs. Pair Motif+ can solve various PMS instances within an hour on a PC, and has a better identification accuracy than the compared algorithms MEME, Align ACE and VINE.The third part focuses on the motif stem search problem for large alphabets(protein sequences). Current stem search algorithms have several notable limitations: motif stems cannot be represented precisely, redundant wildcards are placed, and execution efficiency is low. Motif stem representation is built more precisely by using regular expressions. A method is given for generating all possible motif stems without redundant wildcards. Based on the stem representation and stem generation, an efficient exact algorithm named Stem Finder is proposed. Compared with previous algorithms, Stem Finder runs much faster and reports fewer stems that can cover all(l, d) motifs.The fourth part focuses on the motif discovery problem for large DNA data sets(Ch IP-seq data sets). It is difficult for current algorithms to efficiently process the full-size Ch IP-seq data sets. A new word counting based motif discovery algorithm named MCES is proposed, by mining and combining substrings with high occurrence frequencies. To handle larger data sets, a Map Reduce-based distributed strategy to mine substrings is designed. MCES is able to efficiently and effectively process thousands to millions of input sequences, and runs faster than the state-of-the-art(l, d) motif discovery algorithms. Also, MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm Cis Finder.
Keywords/Search Tags:motif discovery, motif stem search, pattern-driven, Ch IP-seq, Map Reduce
PDF Full Text Request
Related items