Font Size: a A A

Algorithms For Genomic Scaffold Filling Problem

Posted on:2022-02-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:J J MaFull Text:PDF
GTID:1480306608479964Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Computational biology is a subject that uses computer technology to store,analyze,model and calculate biological data,and obtain biological information from it.It is divided into computational genomics,computational proteomics,computational transcriptomics,etc.Computational genomics researchers usually extract the biological problems in genomics as combinatorial optimization problems,and then apply the methods of solving combinatorial optimization problems to it.Combinatorial optimization problems belong to the category of optimization problems,that is,the optimal solution is obtained from the feasible solution set of combinatorial problems.However,these optimization problems are often difficult to solve.Therefore,most of the problems in computational genomics are NP-Hard.The contents that computational genomics study generally include genome sequencing technology,assembly algorithms,comparison between genomes,genome recombination and so on.Scaffold filling is also a hot new field in computational genomics research.The problem of Scaffold filling can be divided into two types:with and without duplicated genes.The problem of genomical scaffold filling without duplicated genes is relatively simple and polynomially computable.But filling the genome scaffold with duplicate genes is harder to figure out,so most of the problem of Scaffold filling with duplicated genes has been shown to be NP-hard,that is,they have no polynomial time algorithm.Therefore,most researchers devote themselves to the design and analysis of the polynomial time approximation algorithm of the problem,hoping to obtain a better feasible solution of the problem in acceptable time.The problem of Scaffold filling is divided into one-sided and two-sided scaffold filling according to whether the reference genome is given.If one of the two given scaffolds is a complete reference genome,that is,the entire genome of a neighboring species is present as a reference,it is called one-sided scaffold filling.Otherwise,it is called two-sided scaffold filling.There is no polynomial and accurate algorithm for either one-sided or two-sided genome scaffold filling if duplicated genes exist.Under the trend that genome assembly is not completely dependent on whole genome sequencing,the goal of scaffold filling is to fill the missing genes into the incomplete genome sequence to obtain a replacement genome that is close to the original genome,so as to improve the result of genome assembly effectively and reduce the cost of whole genome sequencing.There are many criteria to measure the similarity between the filled scaffold and the original genome,among which the typical ones are genome rearrangement distance,exemplar breakpoint distance,breakpoint number,common partition,common adjacency numbere,etc.Due to the simplicity of calculation of breakpoint number and common adjacency number,most models of scaffold filling problem in recent years aim at minimizing breakpoint number and maximizing the number of adjacencies.In this thesis,we investigate the scaffold filling problem with the goal of maximizing the adjacency number,and analyze the optimal solution structure precisely.By constructing the conflict graph,the problem of finding inserted strings is transformed into the problem of finding the maximum independent set in the conflict graph.Thus,a polynomial approximation algorithm is designed,and the approximation ratio is improved to 1.4+?.In addition,we first introduce duo-preservation to the scaffold filling problem,and propose a new model of scaffold filling problem,which is based on block-matching to maximize duo-preservations.Through analyzing a large number of examples,we analyze the complexity and inapproximability of the problem,and devise a linear exact algorithm for the special case and three polynomial approximation algorithms for general case.The main contributions of this thesis are as follows:1.Accurately analyze the optimal solution of the scaffold filling problem with the goal of maximizing the number of adjacenciesIn this thesis,by analyzing a large number of examples,we study the optimal solution of the scaffold filling problem based on maximum matching model and aiming at maximizing the number of adjacencies.By finding the correlation between the insertion strings,we introduce the concepts of relevant graph and path component,which represent the strings that are composed of several missing genes and must be inserted as a whole as a path component.If the two path components have an influence on each other,it means that the two missing gene string sets affect each other when inserted.Therefore,by merging the two path components into one using certain methods,we can merge the two inserted string sets into one,and insert them as a whole.All the strings are inserted into the corresponding position at the same time,so that more adjacencies can be obtained.Through strict proof,we obtain a special optimal solution with the least number of path components.By analyzing the structure of the optimal solution,we obtain the upper bound of the adjacency number in the optimal solution.This work is the basis for the design of following approximation algorithms.2.Design a(1.4+?)-approximation algorithm for two-sided scaffold filling to maximize the number of adjacenciesThe problem of scaffold filling with the goal of maximizing the number of adjacencies is NP-Hard.At present,the approximation ratio of the best approximation algorithm for the two-sided scaffold filling problem is 1.5.In this thesis,we devise a polynomial-time approximation algorithm with an approximate ratio of 1.4+? for the two-sided scaffold filling problem based on maximum matching model to maximize the number of adjacencies.By constructing the conflict graph with special properties,the approximation algorithm proves that the conflict graph is a k-Claw Free graph.The problem of finding the inserted strings is transformed into the problem of finding the maximum independent set in the k-Claw Free graph.Therefore,we use the polynomial time algorithm designed by Halldorsson to find independent sets and obtain enough inserted strings.Through strict proof and analysis,we prove the approximation ratio of the proposed algorithm is 1.4+?,where ? is a constant greater than 0.3.Design an exact algorithm for the special case of the scaffold filling problem to maximize the number of increased duo-preservationsSince the model based on maximum matching to maximize the adjacency number for scaffold filling has been difficult to meet the actual needs of biological data analysis.we introduce duo-preservation to the scaffold filling problem,and propose a new model for scaffold filling,which is based on block-matching and aims to maximize the duopreservations.Using duo-preservation number to describe the similarity of the filled sequences can make the results more accurate.In this thesis,we study the special case for scaffold filling problem to maximize the duo-preservations based on blockmatching model(abbr.OSSF-MIDP).By analyzing a large number of examples,we find a special case of OSSF-MIDP problem,which is called Sim-OSSF-MIDP problem.In the Sim-OSSF-MIDP problem,there are no mismatched genes in the given scaffolds.We devise an exact algorithm with time complexity of O(n).It is proved that the SimOSSF-MIDP problem is polynomially computable and even linearly computable.4.Analyze the complexity and inapproximability for the general case of the scaffold filling problem to maximize the increased duo-preservationsIn this thesis,we mainly study the complexity of the general case for OSSF-MIDP.By constructing an L-reduction from a known Max SNP-Complete problem,i.e.,(3DM-2)1 to the OSSF-MIDP problem,we prove that OSSF-MIDP is NP-Hard and even Max SNP-Complete,which means that there is no polynomial approximation scheme for OSSF-MIDP problem.At the same time,we also study the inapproximability of OSSF-MIDP problem,and prove that the OSSF-MIDP problem cannot be approximated within 16263/16262 by using an inapproximability bound of(3DM-2)1.5.Design three approximation algorithms for the general version of OSSF-MIDPIn this thesis,we obtain the upper bound of the optimal solution of OSSF-MIDP problem and design three polynomial time approximation algorithms.By designing a basic algorithm with an approximate ratio of 2,we guarantee that the number of increased duo-preservations is exactly the same as the number of inserted genes.On this basis,the formula expression of the optimal solution is obtained,and an upper bound is obtained.Then,we devise a polynomial approximation algorithm with an approximate ratio of 12/7 using the greedy strategy;devise a polynomial approximation algorithm with an approximate ratio of 14/9 by the local search technique.
Keywords/Search Tags:Genome, Scaffold filling, Approximation algorithm, Inapproximability
PDF Full Text Request
Related items