Font Size: a A A

Research Of Large-scale Biological Sequence Algorithm Based On The Sliding Window And The Keyword Tree

Posted on:2014-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:2250330398982426Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the implement of human genome project successfully and the increasing quantity of biological data, Bioinformatics has been emerged rapidly. Nowadays, Multiple Sequence Alignment (MSA) is a difficult yet important problem for bioinformatics research, the application range of which is very wide, especially in Gene Identification and Structure Prediction, etc. In most cases, large-scale biological sequence data with high similarity have to be analyzed. However, so far, there is not some satisfying methods and algorithms that can meet all needs of researchers in various fields, due to biological characteristics itself and massive biological data obtained. And it is more and more disappointing that serial algorithm is inefficient with biological data. The research will be focusing on how to apply the parallel computing in Multiple Sequence Alignment and its parallel processing strategy. Meanwhile, based on that, we propose a novel MSA too, namely P-MA.Firstly, We will be a detailed analysis on a detailed analysis of the MSA, which are widely researched and applied in recent years. Star-Align algorithm is often utilized to deal with the large-scale data, for it is favourable in optimization and sensitivity. However, square time complexity is a bottleneck for further research. People start to seek a balance that can improve the speed with nearly optimal well. Introduce parallel processing to the algorithm of Multiple Sequence alignment based on the improved Star-Align algorithm. Instead of utilizing dynamic programming, we adopt the improved center star alignment algorithm, MA, which are more suitable for sequences with high similarity. What’s more, the time complexity of the algorithm can reduce almost to linearity.A novel method for the MSA problem has been proposed, which employs the keyword tree and the sliding window to match a set of substrings and the rest regions are aligned by the dynamic programming. The method provides the dynamic adjustment mechanism for the sliding window size and the step length. The self-adaptive parameters play a extremely important part for improving the performance of the method. Experimental results show that the proposed method is computational efficient and can obtain good performance.Nevertheless, it still’ relatively need lots of the running time. Especially, In the reference sequence selection for many sequences actually, most of which are idle and don’t do any things. In addition, it is the high computational complexity of MSA. Based on those, we study the parallel processing strategy based on MA and propose P-MA that can reasonably assign the task and handle the idle sequences in the reference sequence selection. The processing of those sequences only processed by the serial processing before becomes the parallel processing on different processors. After the reference sequence selection, the algorithm can continue to align. This method is also complied with the sliding adjustment mechanism and there is a sepcial emphasis on rationality in the allocation. For parallel comuting will bring the additional overhead, the adjustment mechanism and the allocation are so important for P-MA. For every step of parallel processing strategy for the algorithm in the research, we build P-MA, change parameters, and discuss the serial processing in the pair-wise alignment stage and the feasibility of the parallel, sequentially.Algorithm the purpose of the improved algorithm is to speed up the operating, it must guarantee its accuracy within an acceptable range. The algorithm never lets the comparison results lost the research value and the biological significance. The algorithm can reduce the time complexity of problem to a linear level. After the reference sequence selection, finally, we can get the results of MSA with the dynamic programming. The main objective of this paper is to improve the MA. It makes full use of the idle sequences. Experimental results show that the DNA Multiple Sequence star alignment with parallelism is effective and feasible, especially for the sequences with the high similarity. In addition, through a series of experiments on biological sequence data, it shows that P-MA algorithm is just slightly better than OA in precision. However, the processing speeds of P-MA algorithm are much higher than OA and MA, especially in condition of longer input sequence and more data. This is not only to ensure the accuracy of the results, but also greatly improved operational efficiency of the algorithm.
Keywords/Search Tags:Multiple Sequence Alignment, Star-Align Algorithm, Parallel Computing, Sliding Window, Keyword Tree
PDF Full Text Request
Related items