Embedding-based subsequence matching in large sequence databases

Posted on:2011-03-12

Degree:Ph.D

Type:Dissertation

University:Boston University

Candidate:Papapetrou, Panagiotis

Full Text:PDF

GTID:1448390002453221

Subject:Computer Science

Abstract/Summary:

Sequential data, such as time series and categorical sequences, naturally appear in a wide variety of domains including financial and scientific data, human activity, biological sequences, etc. In such domains large databases of sequences are used as knowledge repositories. Information retrieval from such repositories is challenging, due to the large amount of data that needs to be searched.;Our attention is focused on subsequence matching methods that employ dynamic programming-based distance measures. Such approaches are robust to misalignments and time warps and are widely used for time series and DNA matching. Three methods are proposed for efficient subsequence matching in large sequence databases.;The first method works for time series databases and the Dynamic Time Warping (DTW) distance. It converts subsequence matching to vector matching using an embedding that maps each database time series into a sequence of vectors. The embedding is computed by applying the full DTW distance between each reference and database time series. At query time, the embedding of the query time series is computed in a similar manner. Relatively few areas of interest are identified by performing vector comparisons, and are then fully explored using the exact DTW distance. The second method defines a similar type of embedding that stores additional information into the vector representation and significantly improves the efficiency of subsequence matching under the constrained Dynamic Time Warping (cDTW) distance.;The third method speeds up retrieval of optimal subsequence matches in string databases, under the Edit Distance and the Smith-Waterman similarity measure. Filtering of candidate matches is performed using precomputed alignment scores between the database sequence and a set of fixed-length reference sequences. At query time, the query sequence is partitioned into segments of the same length as the reference sequences. For each of those segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate matches. Experiments show that the proposed method outperforms BLAST by over an order of magnitude in retrieval runtime for large queries (up to 10,000 bases) and similarity levels of up to 15%.

Keywords/Search Tags:

Time, Subsequence matching, Large, Data, Embedding

Related items

1	Efficient subsequence matching with LCS
2	Optimal subsequence bijection and classification of imbalanced data sets
3	Efficient Complex Pattern Matching Approaches On Large-scale Text Collections
4	Research Of The Algorithm Of Similarity Query And Classification With Shapelets Based On DTW Distance
5	Research Of Detection Algorithm For Abnormal Subsequence Of Single Time Series
6	Research On Fast Query Technology Of Similar Subsequences In Time Series Data
7	The Research On Algorithms For The Longest Common Subsequence Problem And Variants
8	Matrix Embedding For Large Payloads And Its Applications In Steganography
9	Research On Embedding Algorithm For Large User Digital Fingerprinting System
10	Research On Key Technologies In Data Stream Analysis