Font Size: a A A

Algorithms for large scale DNA copy number data

Posted on:2013-12-01Degree:Ph.DType:Dissertation
University:Southern Methodist UniversityCandidate:Wang, SilingFull Text:PDF
GTID:1454390008468700Subject:Biology
Abstract/Summary:PDF Full Text Request
High-throughput array-based assays have recently been developed to detect DNA copy number (DCN) aberrations. Identifying DCN aberrations is highly important for finding tumor suppressor genes and oncogenes. But the DCN data from these arrays is characterized by high levels of noise and unequal spacing of the probes on the genome.;There are several types of methods suggested to analyse DCN data. One type is denoising and smoothing approaches, which try to reduce the noise in the data. The other type is segmentation approaches, which try to identify the chromosomal segments with copy number aberrations.;Then a novel stationary wavelet denoising scheme based on interpolation for DCN data is developed. Empirical results on synthetic data showed that our method outperformed the best previously proposed wavelet denoising method by 4.6% – 12.7% as measured in the root mean squared error. Experiments on a real data set also confirmed the applicability of our method to real DCN data.;After that, a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation is developed. The tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task don't model the tumor/normal cell mixture ratio explicitly. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves 40% to 60% decrease in misclassification rate on average as compared to two previous methods, namely Circular Binary Segmentation and Hidden Markov Model.;HMM is a good model to parse noisy data with hidden states. It is already really successful in speech recognition and shape identification. So it is highly potential to be effective to process DCN data with high noise level. We proposed a Gaussina mixture hidden markov model (GMHMM) method to divide noisy DCN data into loss, normal and gain three states. Our GMHMM is proved to be more accurate in classification rate than CBS, previous HMM and ultrasome on both synthetic data and real data.;After we did preliminary analysis on DCN data, the further step is to do data mining on the data. For example, cancer classification according to the features of the data, applying gene ontology on the data to retrieve the meaning of the data. In order to do cancer classification on DCN data, we introduced a method using optimization of interval thresholds to do cancer classification analysis. The underlying function in aCGH DNA copy number data is a piece-wise constant square-wave function. So instead of using all the probes, less number of features can be used to represent the DNA copy number data. In this way, we avoided the "curse-of-dimensionality" problem. Better classification accuracies and P values are obtained by using intervals as features than using probes as features.;Gene ontology is a controlled vocabulary created to describe genes' functions. There are many web tools to find the biological interpretation of an interesting gene list in the context of the Gene Ontology based on Fisher's exact test, such as EASE, GoMiner etc. They require the user to select a list of significantly disregulated genes from the whole list of genes on a microarray. This gene selection step can be difficult due to potentially inaccurate P-value estimation after multiple testing correction. After applying t-tests on a whole gene set on a microarray then ranking according to P-values, we developed a novel method to combine P-values; eliminating the need for a gene-selection step. We were able to obtain better results than we could with EASE as measured by comparing the receiver-operating characteristic curves.
Keywords/Search Tags:DNA copy number, Data, DCN, Method, Developed
PDF Full Text Request
Related items