Algorithms for large scale DNA copy number data

Posted on:2013-12-01

Degree:Ph.D

Type:Dissertation

University:Southern Methodist University

Candidate:Wang, Siling

Full Text:PDF

GTID:1454390008468700

Subject:Biology

Abstract/Summary:

PDF Full Text Request

High-throughput array-based assays have recently been developed to detect DNA copy number (DCN) aberrations. Identifying DCN aberrations is highly important for finding tumor suppressor genes and oncogenes. But the DCN data from these arrays is characterized by high levels of noise and unequal spacing of the probes on the genome.;There are several types of methods suggested to analyse DCN data. One type is denoising and smoothing approaches, which try to reduce the noise in the data. The other type is segmentation approaches, which try to identify the chromosomal segments with copy number aberrations.;Then a novel stationary wavelet denoising scheme based on interpolation for DCN data is developed. Empirical results on synthetic data showed that our method outperformed the best previously proposed wavelet denoising method by 4.6% – 12.7% as measured in the root mean squared error. Experiments on a real data set also confirmed the applicability of our method to real DCN data.;After that, a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation is developed. The tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task don't model the tumor/normal cell mixture ratio explicitly. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves 40% to 60% decrease in misclassification rate on average as compared to two previous methods, namely Circular Binary Segmentation and Hidden Markov Model.;HMM is a good model to parse noisy data with hidden states. It is already really successful in speech recognition and shape identification. So it is highly potential to be effective to process DCN data with high noise level. We proposed a Gaussina mixture hidden markov model (GMHMM) method to divide noisy DCN data into loss, normal and gain three states. Our GMHMM is proved to be more accurate in classification rate than CBS, previous HMM and ultrasome on both synthetic data and real data.;After we did preliminary analysis on DCN data, the further step is to do data mining on the data. For example, cancer classification according to the features of the data, applying gene ontology on the data to retrieve the meaning of the data. In order to do cancer classification on DCN data, we introduced a method using optimization of interval thresholds to do cancer classification analysis. The underlying function in aCGH DNA copy number data is a piece-wise constant square-wave function. So instead of using all the probes, less number of features can be used to represent the DNA copy number data. In this way, we avoided the "curse-of-dimensionality" problem. Better classification accuracies and P values are obtained by using intervals as features than using probes as features.;Gene ontology is a controlled vocabulary created to describe genes' functions. There are many web tools to find the biological interpretation of an interesting gene list in the context of the Gene Ontology based on Fisher's exact test, such as EASE, GoMiner etc. They require the user to select a list of significantly disregulated genes from the whole list of genes on a microarray. This gene selection step can be difficult due to potentially inaccurate P-value estimation after multiple testing correction. After applying t-tests on a whole gene set on a microarray then ranking according to P-values, we developed a novel method to combine P-values; eliminating the need for a gene-selection step. We were able to obtain better results than we could with EASE as measured by comparing the receiver-operating characteristic curves.

Keywords/Search Tags:

DNA copy number, Data, DCN, Method, Developed

PDF Full Text Request

Related items

1	Research On Cancer Copy Number Variation Detection Methods For Next-Generation Sequencing Data
2	An Integrated Bioinformatics Study Of DNA Copy Number Variation And Differentially Expressed Gene
3	Genome-wide Copy Number Variation Polymorphism In Yunnan Normal Population And Its Clinical Application
4	Clinical And Experimental Study Of FCGR3B Copy Number Variations In Renal Diseases
5	Clinical Application Of Non-invasive Prenatal Testing Technology For Fetal Chromosome Copy Number Variation Detection
6	Accurate Inference Of Tumor Purity And Absolute Copy Number From High-throughput Sequencing Data
7	The Genomic Copy Number Variation In Adult Acute Lymphoblastic Leukemia
8	A Template Block QPCR For Rapid Detection Of Copy Number Variations
9	A New Method For Evaluate The Fetal Conus Medullaris Position By Ultrasound And Explore Copy Number Variants In Tethered Cord Fetuses
10	Integrative Analysis Of "OMICS" Data Of Hepatocellular Carcinoma And De Novo Germline Copy Number Variants In Hepatocellular Carcinoma