Font Size: a A A

Embedding-based subsequence matching in large sequence databases

Posted on:2011-03-12Degree:Ph.DType:Dissertation
University:Boston UniversityCandidate:Papapetrou, PanagiotisFull Text:PDF
GTID:1448390002453221Subject:Computer Science
Abstract/Summary:
Sequential data, such as time series and categorical sequences, naturally appear in a wide variety of domains including financial and scientific data, human activity, biological sequences, etc. In such domains large databases of sequences are used as knowledge repositories. Information retrieval from such repositories is challenging, due to the large amount of data that needs to be searched.;Our attention is focused on subsequence matching methods that employ dynamic programming-based distance measures. Such approaches are robust to misalignments and time warps and are widely used for time series and DNA matching. Three methods are proposed for efficient subsequence matching in large sequence databases.;The first method works for time series databases and the Dynamic Time Warping (DTW) distance. It converts subsequence matching to vector matching using an embedding that maps each database time series into a sequence of vectors. The embedding is computed by applying the full DTW distance between each reference and database time series. At query time, the embedding of the query time series is computed in a similar manner. Relatively few areas of interest are identified by performing vector comparisons, and are then fully explored using the exact DTW distance. The second method defines a similar type of embedding that stores additional information into the vector representation and significantly improves the efficiency of subsequence matching under the constrained Dynamic Time Warping (cDTW) distance.;The third method speeds up retrieval of optimal subsequence matches in string databases, under the Edit Distance and the Smith-Waterman similarity measure. Filtering of candidate matches is performed using precomputed alignment scores between the database sequence and a set of fixed-length reference sequences. At query time, the query sequence is partitioned into segments of the same length as the reference sequences. For each of those segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate matches. Experiments show that the proposed method outperforms BLAST by over an order of magnitude in retrieval runtime for large queries (up to 10,000 bases) and similarity levels of up to 15%.
Keywords/Search Tags:Time, Subsequence matching, Large, Data, Embedding
Related items