Font Size: a A A

Computational Modeling for Genome Annotation Confidence and DNA Methylation Susceptibility

Posted on:2011-06-02Degree:Ph.DType:Thesis
University:Indiana UniversityCandidate:Yang, Young IkFull Text:PDF
GTID:2463390011972733Subject:Biology
Abstract/Summary:
High throughput technologies have transformed biological and medical sciences to data-driven sciences. In particular, new high throughput sequencing technologies, known as the next generation sequencing technologies, have initiated many research projects that produce a huge amount of data. Scientists face the challenge of coming up with novel ways of data utilization to acquire new biological knowledge. In this thesis, I researched two important bioinformatics problems: the confidence scoring of gene function prediction, and the modeling of DNA methylation susceptibility. Since these data are unprecedented, these two research problems have not been well-defined. The key issues discussed in the thesis are: (1) identification of research problems, (2) formulation of research problems, (3) development of computational procedures/algorithms, and (4) evaluation of the algorithms. Thus, the significance of my research contribution is not only to the development of computational solutions but also to the identification and formulation of two research problems that have not been previously defined.;The first problem considered here is the improvement of the genome annotation process using a novel gene annotation confidence score (ACS) with a modified logistic curve. ACS is a scoring system for gene annotation which works by combining sequence and text similarity. In practice, the annotation of genomes relies on the manual verification of examining annotations one by one. ACS can provide information on the annotation quality, and scientists can focus only on a small subset of genes with low ACS, which can reduce the cost for genome project significantly. The effectiveness of ACS was evaluated in various genome selections. The second problem under investigation is the modeling of tissue-specific CpG methylation susceptibility using a k-mer mixture logistic regression function. DNA methylation plays an important role in gene activation in normal and diseased cells, therefore modeling DNA methylation susceptibility can highlight any potential relationship between genomic features and DNA methylation in a tissue-specific context. Extensive experiments were performed using recent sequencing data.
Keywords/Search Tags:DNA methylation, Annotation, Data, Sequencing, Modeling, Genome, ACS, Confidence
Related items