Statistical methods and software for ChIP-Seq data analysis

Posted on:2013-10-09

Degree:Ph.D

Type:Thesis

University:The University of Wisconsin - Madison

Candidate:Chung, Dongjun

Full Text:PDF

GTID:2458390008464929

Subject:Statistics

Abstract/Summary:

PDF Full Text Request

Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) has been successfully used for genome-wide profiling of transcription factor binding sites, histone modifications, and nucleosome occupancy in many model organisms and humans. This thesis focuses on developing statistical methodologies and software to analyze ChIP-Seq data in an unbiased way.;This thesis is composed of three major parts. In the first part, we discuss statistical challenges in identification of binding events in repetitive regions. The state of the art for analyzing ChIP-Seq data relies only on using reads that map uniquely to a relevant reference genome (uni-reads). We developed CSEM, a general statistical approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-Seq experiments.;In the second part, we investigate statistical challenges in identification of closely spaced binding events. Because the compact prokaryotic genomes harbor binding sites some of which are separated by only a few base pairs, applications of ChIP-Seq in this domain have not reached their full potential. Although paired-end tag (PET) assay enables higher resolution identification of binding events than single-end tag (SET) assay, standard ChIP-Seq analysis methods are not equipped to utilize PET-specific features of the data. To address this problem, we developed dPeak, a high resolution binding site identification algorithm, that is applicable with PET and SET data. Our computational and experimental results show that when coupled with PET data, dPeak can identify closely spaced binding sites with high accuracy.;In the third part, we describe our three novel ChIP-Seq data analysis software, csem, mosaics, and dpeak. These three software address each of three important problems in ChIP-Seq data analysis, which are identification of binding events in repetitive regions, consideration of important sequence biases in peak calling, and identification of closely spaced binding events, respectively. Through applications to real ChIP-Seq data, we illustrate how these software can reveal novel biological insights that are currently ignored in standard ChIP-Seq data analysis.

Keywords/Search Tags:

Chip-seq, Software, Binding, Statistical

PDF Full Text Request

Related items

1	Genomic investigation of SREBP family transcription factors using ChIP-chip and ChIP-seq
2	The Statistical Software Platform With A Certain Extent Of Intelligence
3	Research Of Process-level Redundant Static Binding And Dynamic Binding Mechanism Based On Domestic Multicore Processor
4	Research On UML-Based Statistical Software Testing
5	Research And Application Of Method In Domain Analysis Based On Feature Binding Unit
6	Harmonic analysis of ChIP-seq
7	Visual Cognitive Mechanism Of Color And Shape Features Binding And Study Of Computer Modeling Methods
8	The Study Of Characterization And Prediction Of Binding Sites On Proteins Based On Machine Learning Methods
9	Statistical problems in DNA microarray data analysis
10	Research On Binding Table's Security Problems Of IPv6 Source Address Validation In SDN