Font Size: a A A

Two MLCS Branch-and-Bound Algorithms Based On Memory Saving Or Fast Construction Of Graph

Posted on:2023-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:B T JiangFull Text:PDF
GTID:2558306905499784Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In our life,we receive many kinds of information anytime and anywhere,and information can usually be abstracted into sequences of finite characters.Taking DNA as an example,it is a sequence organically composed of four bases A,C,G,and T.Finding the longest common subsequences of multiple sequences(i.e.,the MLCS problem)is one of the most important research interests in sequence mining,which has a wide range of applications in bioinformatics,pattern recognition,text analysis and other fields.However,in the era of big data,the number of sequences in MLCS problem is increasing and the length of them is getting longer and longer.Many algorithms cannot complete searching the solution in an acceptable time,and run out of memory.So the need for more efficient and practical algorithms is increasingly urgent.The current research status of MLCS problem at home and abroad is first summarized,and then two new algorithms for overcoming the shortcomings of the existing algorithms are proposed.(1)A branch and bound method that saves storage space is proposed.First,a more reasonable lower bound estimation strategy than the existing ones is designed.To construct a local directed acyclic graph(DAG graph)for the sequence,the size and distribution of the matching point coordinates are both considered when retaining nodes in each layer,so that the real common subsequence obtained is as long as possible.Then,a simpler upper bound estimation strategy is devised.A new concept "boundary point" is defined,and a new upper bound calculation formula is derived by using the properties of the boundary point.So the upper bound value of the length of the branch where it is located can be quickly estimated for the obtained new matching point,and it is added to the graph only when its upper bound value is not less than the lower bound value.In addition,a storage-saving graph model is designed.We let the DAG graph only retain nodes with in-degree 0 at any time,and each node stores the string set corresponding to all paths from the source point to the node itself.Based on the standard data set,the newly proposed algorithm is compared with several representative algorithms,and it is proved that the algorithm is effective and efficient.(2)A branch and bound method with rapidly constructing DAG graphs is designed.Through the analysis of the first proposed algorithm,we find it has some deficiencies.Because of its layer-by-layer composition,the upper bound is estimated for the new nodes generated by each layer.Many nodes will appear in different layers for many times,so the upper bound calculation will be repeated for a large number of nodes.Aiming at tackling this issue,a realtime updated lower bound estimation strategy and a new upper bound estimation strategy are designed,and the algorithm framework is revised.First,a complete DAG graph is quickly constructed from the source point to ensure that there are no duplicate points in the graph.Then,we perform depth-first backtracking from the end point to the source point,estimate the upper bound of the passed nodes during backtracking,delete redundant points and branches where redundant points are located in time,and update the lower bound in real time according to the length of the resulted sequence obtained by backtracking.The realtime updated lower bound estimation strategy further improves the pruning ability of the algorithm.When all branches are backtracked,all longest common subsequences will be obtained.In addition,when the size of the graph is too large,the data stored in the system will exceed the memory.This method adds a physical method.When composing the graph,the memory and the external memory are cooperated,and some data structures are stored in the external memory,so that the system can save more Large DAG graph.Comparing the newly proposed algorithm with several representative algorithms on standard datasets,it indicates that the newly proposed algorithm can handle the MLCS problem of larger scale sequences.
Keywords/Search Tags:branch and bound, longest common subsequence, directed acyclic graph, local expansion, backtrack
PDF Full Text Request
Related items