Font Size: a A A

Statistical methods for the analysis of genomic data from tiling arrays and next generation sequencing technologies

Posted on:2010-12-11Degree:Ph.DType:Thesis
University:The University of Wisconsin - MadisonCandidate:Kuan, Pei FenFull Text:PDF
GTID:2440390002476815Subject:Statistics
Abstract/Summary:
Advancements in hybridization and sequencing technologies have generated massive amounts of ChIP-Chip and ChIP-Seq data to elucidate the locations of DNA-protein interactions. This thesis focuses on developing statistical methodologies to facilitate the analysis, integration and interpretation of these data and is composed of three major parts.;The first part is on statistical challenges in analyzing nucleosome occupancy and modification data. Having a reliable map of nucleosome positions is important in studying relationships between positioning and dynamic changes in nucleosome occupancy and gene regulation. However, the highly heterogeneous nature of nucleosome densities across genomes poses challenges in mapping nucleosome positions. We propose a non-homogeneous hidden-state model (NHSM) based on first order (lagged) differences of experimental data along genomic coordinates that can automatically detect nucleosome positions of various occupancy levels. Based on the NHSM annotation, we develop a systematic framework for characterizing different types of chromatin remodeling patterns. In addition, we also evaluate some normalization issues in tiling arrays and illustrate the main pitfall of MA normalization in correcting for dye bias.;In the second part, we investigate the correlation structure in ChIP-Chip data that arises due to tiling array designs. We illustrate the pitfalls of ignoring the correlation structure and the limitation of the current moving average approach which assumes exchangeability of the measurements within an array. We then develop a robust and rapid method called CMARRT by incorporating the correlation structure for analyzing ChIP-Chip data.;The final part is on developing a new model-based approach for detecting enriched regions in both one sample and two sample ChIP-Seq data. Our model is based on a hierarchical mixture model which gives rise to a zero-inflated negative binomial (ZINB) in one sample problem, coupled with a hidden semi-Markov model (HSMM) to address the sequencing depth and biases, the inherent spatial data structure and allows for detection of multiple non-overlapping variable size peaks. We also propose a new meta false discovery rate (FDR) control at peak level which is more desirable than the usual heuristic postprocessing of enriched bins identified via bin level FDR control.
Keywords/Search Tags:Data, Sequencing, Tiling, Statistical
Related items