Studies On Algorithms And Complexity For Exemplar Breakpoint Distance Problem

Posted on:2016-08-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z X Wei

Full Text:PDF

GTID:1220330461485530

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of biological techniques and the progress of computer science and technology, bioinformatics which is an interdisciplinary subject of biology and computer has attracted more and more attention. Bioinformatics can be understood as the applications of computers in biology. Comparative genomics is an important research area in bioinformatics. Since the late 1990s, comparative genomics began to gradually rise, it has been through the course of the past 20 years. The traditional research assumption of comparative genomics is two genomes do not contain duplicate genes. While this assumption is justified for many small virus or mitochondrial genomes, in most cases, however, they are not satisfied, such as mammals and plants etc. For example, it is estimated that 15% of genes are duplicated in the human genome .Research on exemplar breakpoint distance problem is under the assumption that genomes contain duplicate genes.Finding conserved sets of genes (or sequences) between two genomes is one of the fundamental problems in comparison of genomes. A set of genes that is conserved in the same order in genomes suggests that they participate in the same biological process. In order to find conserved sets of genes in genomes, many similarity measures have been proposed, e.g., the number of breakpoints, the number of adjacencies, the number of conserved intervals and the number of common intervals. All these measures are well defined when genomes do not contain gene duplicates. However, gene duplicate is one of the primary driving forces in the evolution of genomes and occurs frequently (Dennis et al.,2012) .When duplicates occur, a gene may appear in several places of a genome.The exemplar breakpoint distance problem is motivated by finding conserved sets of genes between two genomes. It asks to find respective exemplars in two genomes to minimize the breakpoint distance between them. If one genome has no repeated gene (called trivial genome) and the other has genes repeating at most twice, it is referred to as the (1,2)-exemplar breakpoint distance problem, or EBD(1,2) for short.Although EBD(1,2), as the most fundamental exemplar breakpoint distance problem, has been known APX-Hard for several years , little is done in terms of algorithm design. In fact, whether EBD(1,2) admits a fixed-parameter algorithm remains open until now. Zhu (2009) showed that EBD(2,2) rejects any FPT-algorithm if use the exemplar distance as the parameter. It makes whether EBD(1,2) admits a fixed-parameter algorithm based on some parameter more interesting.For the fixed-parameter algorithm of EBD(1,2), the research of this paper is performed from the following aspects:(1) In this paper, we design the first fixed-parameter algorithm for EBD(1,2) using span of the nontrivial genome, say s, as a parameter, where the span of a genome is introduced to represent the maximum number of genes which are spanned by two genes of the same gene family in that genome. This depends on a practical observation that two genes of the same gene family are usually separated by a small number of genes in a genome (Shi et al.,2010). Our algorithm runs in O(4sn2) time and O(4sn) space for an EBD(1,2) instance with n gene families in each of its genomes. It has not been seen before.(2) We have implemented the above algorithm in C++. Simulations on randomly generated data show that our algorithm can always handle a genome of 5000 gene families within 1000 seconds when the span of the genome is 10, and can handle a genome of 1000 gene families within 1600 seconds when the span of the genome is 15. The software package can be accessed from the following website: http://www.cs.sdu.edu.cn/dmzhu/exemplar-distance/software.htmlDe novo peptide sequencing based on tandem mass spectrometry is the early research problem of the author. De novo peptide sequencing based on tandem mass spectrometry is an important method for protein identification in proteomics. It can extract the amino acid sequence of an unknown peptide P without the help of auxiliary protein database. De novo peptide sequencing generally transforms the tandem mass spectrometry of the unknown peptide P into a spectrum graph G, and by searching some proper paths in graph G to identify the amino acid sequence of the unknown peptide P. A path in graph G may represent a possible peptide, we want to find one or more peptides with the maximum similarity with P.In the process of constructing spectrum graph G, because we do not know every peak p in the tandem mass spectrum is the prefix or the suffix of an unknown peptide P, we use two vertices v and v to represent every peak p in the spectrum graph G. vertices v represents the peak p is the prefix, vertices v represents the peak p is the suffix, but by minusing the suffix subsequence from the unknown peptide P, we get the prefix. In this way, for each peak p, no matter whether it is the prefix or the suffix, we can always use a vertex to represent that the peak p corresponding to the prefix of the unknown peptide P. Because each vertex in the graph represents a prefix of an unknown peptide P, if the difference of the corresponding mass values of any two vertices equals to the mass value of one or several amino acids, it means these two prefixes differ by one or several amino acids.Dancik et al. [39] design a weighted algorithm for spectrum graph G, and by searching a longest path in the spectrum graph to find a peptide with the maximum similarity with the unknown peptide P. But they noticed that each peak p generated two vertices v and v in the spectrum graph, the longest path passing v and v simultaneously is meaningless. Because one peak cannot be both the prefix and the suffix. Then the problem become to find a longest path which cannot pass some vertices simultaneously. If we call the two vertices which cannot be passed simultaneously as forbidden pairs, because paths avoiding forbidden pairs problem is NP-Hard,so longest path avoiding forbidden pairs problem is also NP-Hard. Dancik et al.[39] and Chen et al. also noticed that since each vertex has a mass value, actually the vertices are ordered. We can draw each vertices on the number axis, for any two forbidden pairs (v1,v1) and (v2,v2), the interval corresponds to (v1, v1) and the interval corresponds to (v2, v2) are nested.For de novo peptide sequencing based on tandem mass spectrometry, the research of this paper is performed from the following aspects:(1) De novo peptide sequencing can be transformed into longest path avoiding forbidden pairs problem (PAFP). General PAFP is NP-complete. In the process of transformation from de novo peptide sequencing to PAFP, forbidden pairs have nested structure. We give a vertex contraction algorithm which is designed based on nested structure of forbidden pairs to solve de novo peptide sequencing problem. Vertex contraction algorithm is a polynomial time algorithm, and its time complexity is O(|V|3).

Keywords/Search Tags:

Exemplar, Breakpoint, Genome, Span, De novo peptide sequencing, Forbidden pairs, Nested structure, Vertex contraction, Algorithm

PDF Full Text Request

Related items

1	An Improved Algorithm For The Zero Exemplar Breakpoint Distance
2	A Dynamic Programming Algorithm For (1,2)-Exemplar Breakpoint Distance
3	De novo peptide sequencing using a hidden Markov model
4	A New Non-restrictive Enzymatic Proteomics Analytic Strategy With The De Novo Peptide Sequencing
5	The Whole Genome Sequencing And Semi-nested RT-PCR Detection Method Of Transmissible Gastroenteritis Virus HQ2016 Strain Was Established
6	Algorithm Research Of DNA De Novo Assembly Based On Mate-Pair
7	Study On Peptide Sequencing Method Based On Graph Neural Networks
8	Design And Implementation Of De Novo Genome Assembly
9	Detection Of Genome Variants Based On Hight Throughput Sequencing Data
10	De Novo Mutation Detection Method Based On High-Throughput Sequencing Data