Font Size: a A A

The Normalization Of High-throughput Omics Data For Cancer

Posted on:2012-02-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:D WangFull Text:PDF
GTID:1224330368998522Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
With the development of high-throughput technologies for genome-wide expression, methylation and copy number, we can find a large number of abnormal genes with these molecular changes in cancer that can provide us the opportunity to explore cancer mechanism from a more global perspective. When we analyze these high-throughput omics data, researchers usually normalize microarray data to remove technical variations based on the assumption that only a few genes are differentially expressed in a disease and have balanced upward and downward expression level changes. However, accumulated evidences suggest gene expressions could be widely changed in cancer, so we need to evaluate the sensitivities of biological discoveries to violation of the normalization assumption. Hence, using cancer high-throughput datasets unbiasedly collected from database, we analyzed the systemic changes of these moleculars between cancer and normal samples and evaluated the effects of several widely applied normalization methods on the truly signal distribution.Firstly, we demonstrated the extensive differential expression in cancer. The irreproducibility problem in high-throughput arrays actually reveals the characteristic of extensive differential expression in cancer. We analyzed the particular expression pattern between cancer and normal samples using deregulation direction (up-regulation or down-regulation). The results showed that differential expressed genes (DEG) have coherent expression changes across different studies for particular cancers, indicating that gene expression is widely changed in a specific up- or down-regulation pattern in a particular cancer. Secondly, we analyzed the rationality of the cancer-associated high-throughput data normalizations. The results showed that, at least for cancer study, normalizing all arrays to have the same distribution of probe intensities regardless of the biological groups of samples might be misleading. Gene expression profiles of cancers had widely up-regulated genes, most traditional normalizations may produce a large number of false results for down-regulated DEGs while leaving many up-regulated DEGs undetected. Meanwhile, we found that the deregulation directions of the DEGs could be highly reproducible across different studies for the same cancers, suggesting that effective biological signals naturally exist in these data. Hence, it is possible to develop methods of finding these effective biological signals. Similarly, we analyzed the methylation and copy number profiles, the results showed that the medians of methylation signals in normal samples were not significantly different from those in cancer samples. We found that the differential methylation (DM) genes solely selected by using the normalization method in one dataset for a cancer have significantly consistent methylation states in another independent dataset for the same cancer, indicating that these extra DM genes are effective biological signals. So, the normalization methods could be applied to normalize methylation data and must filter out these false DM genes that the methylation states selected in the normalized data were opposite to those in the non-normalized data. Similar results with gene expression profiles were found in gene copy number profiles that had widely DNA copy number gain, indicating that finding cancer-associated biological signals in the non-normalized data might be a reasonable strategy. Finally, we also analyzed the data preprocessing problem of the correlation of measurements from multiple clones mapped to the same Unigenes in cDNA microarray data. The results showed that the correlation can be greatly improved when using updated annotations of clones. Encouragingly, a large fraction of inconsistent data will be filtered out in the procedure of selecting DEGs. Although various technical variations existed in cDNA microarray, data applications based on DEGs selections could still reach correct biological results, especially at the functional modules level.In this paper, we analyzed the systemic changes of cancer-associated gene expression, methylation and copy number and demonstrated most current normalizations can distort the signal differences between the normal and cancer states. Our results clearly showed effective biological signals that naturally exist in these high-throughput omics data, so it is possible to develop methods of finding these signals. Hence, our results provide a reasonable analysis strategy of high-throughput omics data.
Keywords/Search Tags:Cancer, High throughput technology, Gene expression, Methylation, Copy number, Data normalization
PDF Full Text Request
Related items