Font Size: a A A

Transcriptome detection by multiple RNA tiling array analysis and identifying functional conserved non-coding elements by statistical testing

Posted on:2009-12-13Degree:Ph.DType:Thesis
University:University of California, BerkeleyCandidate:Xu, NaFull Text:PDF
GTID:2448390002996314Subject:Biology
Abstract/Summary:
This thesis aims to identify functional genome regions by developing efficient computational methods analyzing biological data. It is composed of two parts: Part I is on multiple RNA tiling array analysis to detect genome regions that are transcribed, and Part II is a CNE (Conserved Non-coding Element) study to identify the functionality of non-coding genome elements conserved over evolution between human, mouse and fugu (a kind of puffer fish).;Part I. Tiling arrays are becoming instrumental for genome-wide identification and characterization of functional elements. Because of the dynamics activity of functional elements, to get a comprehensive detection of such elements, multiply arrays, e.g. under different experiment condition, from different cell lines, etc., are widely performed. The huge amount of data collected in multiple tiling array experiments and the complicated dependency structure between arrays bring significant challenges in the computational methods. In this thesis, with the context of multiple RNA tiling arrays, we present a random model that elaborates the sources of randomness in multiple tiling array data. With the insights provided by the model, we identify the factors that behave differently between regions with active binding and regions with mainly non-specific binding. By make using of such indicative factors, two useful probe summary statistics, the MISMC score and MISPF score, are proposed for functional element detection in multiple tiling array studies. By integrating two segmentation strategies, the thresholding method and the HMM method, with our proposed summary scores, we developed our multiple tiling array analysis algorithms. Our algorithms are rapid, easy to be implemented, and effective in integrating information across array and neighboring probes. Based on two Affymetrix RNA tiling array data sets (one for human and the other for Drosophila) and currently known genome annotations, our algorithms are shown to have satisfactory detection performance, and have improved detection sensitivities and specificities compared to some other methods. We speculate that our model and analysis algorithms, with appropriate modifications, could be extended to other similar microarray scenarios.;Part II. Protein coding sequences account for only a very small portion of the whole human genome, and many non-coding genome regions are believed to have critical functions for organism mechanism. To identify functional non-coding parts, comparative genomics are widely used, based on the belief that conservation among multiple highly divergent species may result from functional constraints. In our study, we have analyzed 2094 Conserved Non-coding Elements (CNEs) that are above 70% identity and over 80 bp long in whole-genome comparison between human, mouse and fugu with only repeated regions are masked. With extremely high statistical significance, we found that the CNE locate in clusters, locate around developmental regulator genes and genes related to DNA binding, but tend not to be near genes related to cellular process and catalytic activity. In addition, we also examined the gene expression data from GNF (Genomics Institute of the Novartis Research Foundation) database which contains expression profiles for over 14000 human genes in 79 tissues. We found that CNE neighbor genes are significantly highly expressed in nerve and brain tissues. By taking a closer look, we found that CNE neighbor genes associated with binding are more highly expressed in nerve tissues, and CNE neighbor genes associated with transcription regulator activity are more highly expressed in both brain and nerve tissues. Based on these results, we focused on neuron genes for further analysis. While previous studies could not find conserved CNE in invertebrates, we found some conserved patterns in C. elegans by aligning upstream sequences of C. elegans neuron genes with the CNE neighboring the homologs in human and mouse. In particular, we found a 13bp perfectly conserved pattern upstream of unc-30 in C. elegans and upstream of pitx2 (unc-30 homolog) in human and mouse. Moreover, this pattern contains a match to a known binding site of Neurod-1, a human transcription factor with a C. elegans homolog Cnd-1 that has been suggested in the literature to regulate unc-30. Thus, these highly conserved non-coding sequences might also play an important role in the development of the central nervous system in both vertebrates and invertebrates.
Keywords/Search Tags:RNA tiling array, Multiple RNA tiling, Functional, Conserved non-coding, Identify, CNE neighbor genes, Elements, Detection
Related items