Font Size: a A A

SIMD Based String Matching Algorithm For Biological Sequence

Posted on:2016-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2180330476954992Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
String matching is an important problem that has been thoroughly studied in computer science, with broad applications in bioinformatics as well as natural language processing, information retrieval, etc. For example, it is used to find similar sequence or locate a segment in a long sequence. Currently, several string matching algorithms are used on biological sequences, such as tvsbs, graspm etc. With the rapid development of high-throughput sequencing technologies, it has become easier and cheaper to obtain vast quantities of biological sequences, accordingly posing great challenges in searching for a specific pattern within a large volume of biological sequences. Therefore, it is of fundamental importance to design more effective string matching algorithms to address this challenge.There are several algorithms that have developed for exact string matching in the past years.Among them, an algorithm called SSECP begin to use exact packed string matching technique, in which multiple characters are packed into one block-character, so that the characters can be compared in bulk rather than indi-vidually.Here we present Improved Exact Packed String Matching(IEPSM), an exact packed string matching algorithm that is dedicated for biological sequences. IEPSM features optimized word-size packed strings and adopts a big hash value to decrease byte-by-byte comparisons. And IEPSM computes fingerprint values by a hash function using Single Instruction Multiple Data instructions, which sup-ports parallel execution of some operations via a set of special instructions. And we take some experiments to obtain the optimal shift distance.Comparative results on multiple empirical datasets show that IEPSM achieves better efficiency by comparison with existing algorithms. Thus, IEPSM is of broad utility for searching a specific pattern in the era of big biological data.
Keywords/Search Tags:string matching, bioinformatics, Intel SSE instruction
PDF Full Text Request
Related items