Font Size: a A A

Algorithms for characterizing structural variation in human genome

Posted on:2011-08-02Degree:Ph.DType:Thesis
University:Case Western Reserve UniversityCandidate:Yavas, GokhanFull Text:PDF
GTID:2443390002464568Subject:Biology
Abstract/Summary:PDF Full Text Request
Until fairly recently, single nucleotide polymorphisms (SNPs) were thought to be the main source of variation in the human genome. With the advent of high-throughput genome scanning technologies, it has been revealed that there are other forms of genomic variation beyond single base-pair substitutions. These structural alterations include insertions, deletions, inversions, translocations, tandem repeats of DNA sequences and copy number variants (CNVs). Concisely, all of these alterations are referred as structural variations.;CNVs represent the segments of the genome that are polymorphic with regard to genomic copy number. Copy number polymorphisms (CNPs), which can be considered as a specific category CNVs, are defined to be copy number variants that are present, with identical boundaries (and are therefore likely identical-by-descent), in at least 1% of the human population. Tandem repeats, on the other hand, are described as serially repeated segments of the human genome which may have repeat units several hundred kilobases in size.;CNVs, which have been shown to have a role in various diseases such as Alzheimer disease, Crohn's disease, autism and schizophrenia, can be caused by various structural mutations such as duplications and deletions. In the effort to scan the entire genome of human populations, as well as individuals, for CNVs (also CNPs) and tandem repeats, SNP arrays and paired end sequence mapping data have emerged as important tools.;In this thesis, we study the problem of identifying CNVs, CNPs and tandem repeats from these data sources. We first frame CNV identification as an optimization problem with an objective function that is explicitly designed so that its optimal solution is the most accurate set of CNV calls. Our method, termed COKGEN, finds the best solution using a variant of the well-known heuristic simulated annealing. Next, we present a method for identifying and genotyping common CNPs. The proposed method, POLYGON, draws strength from multiple samples to produce copy number genotypes of the samples at each CNP and fine-tune its boundaries. Finally, we present a novel graph theoretical method for determining the tandem repeats from paired-end read data obtained from massively parallel paired-end sequencing of the target genome.
Keywords/Search Tags:Genome, Human, Tandem repeats, Variation, Structural, Method
PDF Full Text Request
Related items