Analysis And Modeling Of A Kind Of Key Problems In Biological Sequence

Posted on:2022-05-12

Degree:Master

Type:Thesis

Country:China

Candidate:Q J Hao

Full Text:PDF

GTID:2480306605473174

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Solving the longest common subsequence(MLCS)of multiple biological sequences(such as DNA sequences and protein sequences)is an NP-Hard problem,and has important applications in disease diagnosis,cancer recognition,text comparison,data mining,etc.The existing mainstream algorithms mainly fall into two categories.One is the recursive algorithm based on dynamic programming.The time complexity of this kind of algorithms increases exponentially with the increase in the number and length of sequences.The other is the algorithms based on the directed acyclic graph(DAG)model.The disadvantage of this type of algorithms is that the built DAG is large in scale,and the time cost and storage overhead are very huge,and it is impossible to solve large-scale MLCS problems with limited resources.In order to overcome these shortcomings,this paper proposes two new efficient algorithms based on the directed acyclic graph DAG model as follows:1.Propose a DAG model based algorithm with path selection and layer updating(PSLU＿MLCS).This algorithm constructs a new DAG graph model.The new DAG scale is smaller than the existing DAG scale.Any longest path from the start point to the end of the new DAG corresponds to an MLCS,and all the longest paths from the start point to the end point are all MLCSs.On the contrary,all MLCSs are all the longest paths of the new DAG from the start point to the end point.The method includes three key strategies:(1)A path selection strategy is designed.In this strategy,a path length upper bound estimation function is proposed to estimate the upper bound value of the path length of each path in the DAG.According to the upper bound value,you can quickly find a longer path from the start point to the end point,and use the length of the path as the lower bound of the MLCS.(2)A path expansion strategy is designed.When we expand the remaining part of a path on DAG,we compare the upper bound value of the length of the remaining part of a path with the lower bound value of MLCS.If the upper bound value does not exceed the lower bound value,there is no need to put this path into the DAG.As a result,the DAG scale can be greatly reduced.(3)A layer updating strategy is designed.When extending the path through the point p,we update the number of layers of the node p.In this way,after the DAG is built,the number of layers of all nodes is the final correct number of layers,and no additional layering algorithm is needed,which reduces the time for seeking MLCS.The scale of the DAG created by the PSLU＿MLCS algorithm is greatly reduced.Experimental results show that this algorithm outperforms the current mainstream algorithms and is very suitable for large-scale MLCS problems.2.Propose a DAG model based algorithm with double step sizes(DS＿MLCS).This algorithm improves the algorithm PSLU＿MLCS in the previous part,and modifies the original DAG construction method based on single step size by constructing the DAG based on double step sizes.The constructed DAG is smaller than the DAG constructed by the previous algorithm.The method includes three key strategies:(1)It creates a double-step DAG model.The double-step DAG model constructs DAG by adding two consecutive characters as a node(i.e.,each node in the constructing DAG represents two consecutive characters,while each node in the existing DAG only represents one character),and a double-step successor table is designed to quickly find the position of the successor nodes.In this way,the DAG scale will be further smaller.(2)A shared node strategy is proposed.In the construction of a double-step DAG,all pairs of two consecutive characters with same coordinate of the second characters are divided into the same type,and only one node is used in constructing DAG to represent this type of all two consecutive characters,and only one edge is connected to its precursor node,which can reduce a large number of nodes and edges,and further reduce the scale of the DAG.(3)A new strategy for searching the longest path is proposed.In the new DAG,each node saves a lot of two consecutive character pairs,and every time a longest path from the start point to the end point is found,multiple MLCSs can be found.It further reduces the time to search the longest path.The DS＿MLCS algorithm is a brand-new method of creating DAG.Any existing algorithm based on the DAG model can be optimized by this method.Experimental results show that this algorithm is ahead of mainstream algorithms with an average performance increase of 30%,and is more applicable to solve the large scale MLCS problems.

Keywords/Search Tags:

Sequence Alignment, Longest Common Subsequence, Upper and Lower Bound Estimation, Path Selection, Shared Node, Layer Update, Double Step Successor Table

PDF Full Text Request

Related items

1	Research Of High Performance Bit-parallel Longest Common Subsequence Algorithm
2	High Performance Algorithms And Models For Large-scale Biological Sequence Analysis
3	Estimation Of The Upper And Lower Bound For The Solutions Of The Unified Algebraic Lyapunov Equations Based On The Delta Domain And Their Numerical Algorithms
4	Graphical Representation Of Dna Sequences And Their Similarity Analysis
5	Graphical Representation Of Protein Sequence And Its Similarity Analysis
6	The General Relationship Is Quasi-fuzzy Rough Set Model
7	Estimation Of Upper And Lower Bounds For Solutions Of The Unified Algebraic Riccati Matrix Equations Based On Delta Operator
8	Research On The Railway Alignment Selection Based On The Optimal Path Technology Of GIS
9	A Design Of Short Gene Sequence Alignment Acceleration System Based On High Performance Hash Table
10	Research On Efficient Time Series Clustering Algorithms