Font Size: a A A

Design And Optimization For A Biological Sequence Matching System Supporting Scoring Matrices

Posted on:2016-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:D J WeiFull Text:PDF
GTID:2370330542957336Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of gene sequencing technology,human can get a mass of biological sequences every day.One of the most important subjects in bioinformatics is to match the biological sequence with the biological motif.Through comparision and analysis,the biological information hidden in the vast amount of sequences can be mined,which helps solve the problem in biology,healthcare and other fields.As a popular choice for modeling signals or mofits in biological sequences,position specific scoring matrices have been widely used in several different scenarios over biological sequence analysis.Finding the matches of scoring matrices in a set of sequences has become the key technology of analysis and interpretion of biological sequences.At present,most widely used biological sequence matching systems rely on the brute-force sliding window approach to evaluate segmental score for each possible starting position and insert the segments that meet the threshold into the result set.This can be a very time expensive routine.In recent years,lots of approaches are proposed to speed up scoring matrices matching based on indexing data structures.The idea is to preprocess the input biological sequences to build an index such as suffix tree or suffix array that allows skipping uninteresting parts that share a common prefix.The drawback of using the suffix tree and suffix array data structure is that it is expensive to store,which severely limits its applicability.Based on the existing work,the thesis firstly presents a new algorithm,called CSAsearch,to efficiently find matches of scoring matrices in large databases.By compressing the suffix array used in ESAsearch,this algorithm is more space-efficient.With respect to the implementation on the suffix array this algorithm does not lose efficiency and is more capable of processing large scale biological sequences.Secondly,by converting the scoring matrices matching problem into the classical string pattern matching problem,a BWT_M algorithm based on BWT self-index structure is proposed to generate and verify all the candidate sequences that match the scoring matrices.When the threshold is small,the matching performace of this algorithm outperforms the existing algorithms.Finally,combined with optimized technologies,the thesis designs a biological sequence matching system that supports scoring matrices and conductes large numbers of experiments on it.By analyzing the result of the experiments,the thesis adjusts structural parameters of index,shows the matching performace of different algorithms under different influencing factors and demonstrates that the designed system is space saving and capable of finding the matches of scoring matrices efficiently.
Keywords/Search Tags:scoring matrix, biological sequence, pattern matching, suffix array, BWT
PDF Full Text Request
Related items