Design And Optimization For A Biological Sequence Matching System Supporting Scoring Matrices

Posted on:2016-05-29

Degree:Master

Type:Thesis

Country:China

Candidate:D J Wei

Full Text:PDF

GTID:2370330542957336

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of gene sequencing technology,human can get a mass of biological sequences every day.One of the most important subjects in bioinformatics is to match the biological sequence with the biological motif.Through comparision and analysis,the biological information hidden in the vast amount of sequences can be mined,which helps solve the problem in biology,healthcare and other fields.As a popular choice for modeling signals or mofits in biological sequences,position specific scoring matrices have been widely used in several different scenarios over biological sequence analysis.Finding the matches of scoring matrices in a set of sequences has become the key technology of analysis and interpretion of biological sequences.At present,most widely used biological sequence matching systems rely on the brute-force sliding window approach to evaluate segmental score for each possible starting position and insert the segments that meet the threshold into the result set.This can be a very time expensive routine.In recent years,lots of approaches are proposed to speed up scoring matrices matching based on indexing data structures.The idea is to preprocess the input biological sequences to build an index such as suffix tree or suffix array that allows skipping uninteresting parts that share a common prefix.The drawback of using the suffix tree and suffix array data structure is that it is expensive to store,which severely limits its applicability.Based on the existing work,the thesis firstly presents a new algorithm,called CSAsearch,to efficiently find matches of scoring matrices in large databases.By compressing the suffix array used in ESAsearch,this algorithm is more space-efficient.With respect to the implementation on the suffix array this algorithm does not lose efficiency and is more capable of processing large scale biological sequences.Secondly,by converting the scoring matrices matching problem into the classical string pattern matching problem,a BWT_M algorithm based on BWT self-index structure is proposed to generate and verify all the candidate sequences that match the scoring matrices.When the threshold is small,the matching performace of this algorithm outperforms the existing algorithms.Finally,combined with optimized technologies,the thesis designs a biological sequence matching system that supports scoring matrices and conductes large numbers of experiments on it.By analyzing the result of the experiments,the thesis adjusts structural parameters of index,shows the matching performace of different algorithms under different influencing factors and demonstrates that the designed system is space saving and capable of finding the matches of scoring matrices efficiently.

Keywords/Search Tags:

scoring matrix, biological sequence, pattern matching, suffix array, BWT

PDF Full Text Request

Related items

1	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array
2	Studied On Gene Sequence Alignment Based On Mixed Suffix Tree And Suffix Array
3	Ultra-large Multiple Sequence Alignment Based On Distributed Computing
4	LM-Suffix: Research On Gene Sequence Index Structure Based On Suffix Tree
5	Negative Sequence Pattern Mining Technology And Application Of Biological Information
6	Alignment-free Sequence Similarity Analysis And Clustering Algorithms On Biological Sequences
7	Sequence Assembly Algorithms For Next-generation Sequencing Technology Research
8	Parallel Optimization Design And Implementation Of Biological Sequence Alignment Algorithm
9	The Design And Implementation Of A Multiple Sequence Alignment Algorithm Based On Suffix Tree Strategy
10	DNA Multiple Sequence Alignment And Similarity Analysis Based On Pattern Matching