Font Size: a A A

LM-Suffix: Research On Gene Sequence Index Structure Based On Suffix Tree

Posted on:2015-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:F Y W HuangFull Text:PDF
GTID:2270330431469108Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
With the deepening of the bioinformatics, a large number of biological information has been produced and is still being produced in a steady stream. If we can quicken to deal with this information, we can better reveal the mysteries of life.For these reasons, the researchers make use of index to improve the processing of sequences.For smaller sequences, Suffix treeis a good index structure, but there exists the so-called "memory bottleneck", so it not suit large sequences. The suffix array is another good structure, as it needs less store space than a suffix tree, but its performance is far lower than a suffix tree; Approaches based on q-gram or q-sample can be used for fast matching, but for the biologysequences of low similarity, it can do nothing.In this paper, based on the deep study of the classical suffix tree, we present LM-Suffix based on a suffix tree index structure. LM-Suffix divides the long biologysequences into some short biologysequences, which solve the "memory bottleneck" that the classical suffix tree exists, and then in order to construct a suffix tree, we sort the short biologysequences by alphabetical order and according to the biggest publicprefix of the two adjacentbiologysequences confirm the branch point. During the constructing a suffix tree, because LM-Suffix is parting a suffix tree, so the classical suffix treesearching technology is not suitable for the algorithm.In this paper, based on the classical suffix treesearching technology, we present a search scheme which is suitable for LM-Suffix, which increase child suffix tree traversal.Finally, we show experimentally that the LM-Suffix can be built effectively with sequences whose index size exceeds the available RAM, and its performance is better, which solve the so-called "memory bottleneck". In addition, We also implement the searching technology using the LM-Suffix.
Keywords/Search Tags:biology sequences, suffix tree, index structure, LM-Suffix
PDF Full Text Request
Related items