Font Size: a A A

Statistical methods and software for ChIP-Seq data analysis

Posted on:2013-10-09Degree:Ph.DType:Thesis
University:The University of Wisconsin - MadisonCandidate:Chung, DongjunFull Text:PDF
GTID:2458390008464929Subject:Statistics
Abstract/Summary:PDF Full Text Request
Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) has been successfully used for genome-wide profiling of transcription factor binding sites, histone modifications, and nucleosome occupancy in many model organisms and humans. This thesis focuses on developing statistical methodologies and software to analyze ChIP-Seq data in an unbiased way.;This thesis is composed of three major parts. In the first part, we discuss statistical challenges in identification of binding events in repetitive regions. The state of the art for analyzing ChIP-Seq data relies only on using reads that map uniquely to a relevant reference genome (uni-reads). We developed CSEM, a general statistical approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-Seq experiments.;In the second part, we investigate statistical challenges in identification of closely spaced binding events. Because the compact prokaryotic genomes harbor binding sites some of which are separated by only a few base pairs, applications of ChIP-Seq in this domain have not reached their full potential. Although paired-end tag (PET) assay enables higher resolution identification of binding events than single-end tag (SET) assay, standard ChIP-Seq analysis methods are not equipped to utilize PET-specific features of the data. To address this problem, we developed dPeak, a high resolution binding site identification algorithm, that is applicable with PET and SET data. Our computational and experimental results show that when coupled with PET data, dPeak can identify closely spaced binding sites with high accuracy.;In the third part, we describe our three novel ChIP-Seq data analysis software, csem, mosaics, and dpeak. These three software address each of three important problems in ChIP-Seq data analysis, which are identification of binding events in repetitive regions, consideration of important sequence biases in peak calling, and identification of closely spaced binding events, respectively. Through applications to real ChIP-Seq data, we illustrate how these software can reveal novel biological insights that are currently ignored in standard ChIP-Seq data analysis.
Keywords/Search Tags:Chip-seq, Software, Binding, Statistical
PDF Full Text Request
Related items