Font Size: a A A

Statistical and informatics methods for analyzing next generation sequencing dat

Posted on:2018-06-15Degree:Ph.DType:Dissertation
University:Emory UniversityCandidate:Chen, LiFull Text:PDF
GTID:1470390020956824Subject:Computer Science
Abstract/Summary:
In the era of genomic big data, it is demanded to develop statistical and informatics methods for the analysis of big data. The integrative analysis of datasets generated from different sources or in different biological conditions is of particular interest. First, we develop a statistical method ChIPComp to perform quantitative comparison of multiple ChIP-seq datasets in different biological conditions. ChIPComp detects genomic regions showing differential protein binding or histone modification by considering data from control experiments, signal to noise ratios, biological variations, and multiple-factor experimental designs in a linear model framework. Simulations and real data analyses demonstrate that ChIPComp provides more accurate and robust results compared with existing methods. By utilizing tens of thousands of trait-associated GWAS SNPs cataloged, we present traseR, a computational tool that could explore the collection of trait-associated SNPs to indicate whether a given genomic interval or intervals is likely to be functionally connected with certain phenotypes or diseases. Real data results indicate that traseR offers a turnkey solution for enrichment analysis of trait-associated SNPs. Besides analyzing datasets from a single source (GWAS or epigenomics), we perform a joint analysis for multiple data sources by annotating GWAS SNPs using thousands of genomic and epigenomic datasets, and building DIVAN, a data-driven machine learning approach that aims to identify disease-specific noncoding risk variants in a genome-wide scale, which is helpful to understand the cryptic link between non-coding sequence variants and the pathophysiology of complex diseases/phenotypes. By being disease-specific, DIVAN demonstrates to be more powerful than competing methods in the identification of disease-specific non-coding risk variants.
Keywords/Search Tags:Methods, Statistical, Data, Genomic
Related items