| Idientifing biomarkers associated with cancer initiation and progression fromhigh-throughput protein and gene expression profiles is an important task of cancerresearch. Although related studies have achieved impressive results, the reproducibilityof cancer biomarkers across different studies is very low, raising doubts about theirbiological significance and clinical application.Thus, it is very important to analyse thefactors capable to affect the reproducibility of cancer biomarkers and to findreproducible biomarkers. In this thesis, we investigated the reproducibility of cancerbiomarkers identified using high-throughput protein and gene expression data.1. The effect of SELDI-TOF (Surface Enhanced Laser Desorption/IonizationTime-Of-Flight) mass spectrometry (MS) data pre-processing algorithms on thereproducibility of cancer-associated peak markers was analysed. The raw SELDI-TOFMS data consists of mass-to-charge ratios (m/z) and their intensity values. The datapre-processing is needed to detect peaks representing proteins or peptides from the m/zvalues and calculate their expression values to produce peak profile for findingcancer-assoicated peak marker. However, for the same SELDI-TOF MS dataset, thepeaks detected by different data pre-processing algorithms vary greatly. Our resultsshowed that the difference between peak profiles affected the reproducibility ofcancer-associated peak markers (here, differentially expressed (DE) peaks) in twoaspects: the absence of some DE peaks in another peak profile and the low statisticalpower of DE peak identification in profiles with a large number of peaks. Therefore, weproposed the2-means clustering stratification approach to improve the statistical powerof identifying DE peaks in large profiles and demonstrated that the reproducibility ofDE peaks identified using different pre-processing algorithms was also improved. Basedon these results, we suggest selecting the pre-processing algorithm able to produce morepeaks and then increasing the statistical power using powerful approaches to identifymore reproducible cancer-associated peak markers.2. Weak differential gene expression signals and reproducible functions associatedwith breast cancer metastasis were revealed. Because the expression difference for a gene is usually low between metastatic and non-metastatic breast primary cancersamples, the statistical power of identifying DE genes may be very low using aconventional false discovery rate (FDR) control level (e.g.5%or10%). Therefore,sufficient DE genes could not be obtained for the subsequent function enrichmentanalysis to extract function markers associated with breast cancer metatasis. In thisstudy, we analyzed five microarray datasets for studying breast cancer metastasis. Forthe two datasets with weak differential gene expression singlas, we used two approachesto select sufficient DE genes for finding functions associated with breast cancermetastasis. First, a2-means clustering stratification approach was used to improve thestatistical power of identifying DE genes for detecting more DE genes. Second,according to the robustness analysis of functional enrichment in the other three datasets,a low FDR control level was used to identifying more DE genes and then functionsassociated with breast cancer metastasis were reliably identified. Next, we proposed astatistical approach to extract reproducible functions associated with breast cancermetastasis based on different datasets. Finally, the reproducible metastasis-associatedfunctions detected using the two approaches metioned above were compared. Theresults showed that some biological processes (such as 'cell division','cell cycle' and'DNA replication') as a whole rather than their sub-processes may be altered during thecourse of breast cancer metastasis, reflecting that breast cancer metastasis is a 'systemsdisease' process with global gene expression changes.In summary, we analysed the reproducibility of cancer biomarkers identified inhigh-throughput SELDI-TOF MS and microarray data and proposed appropriatesolutions to improve the reproducibility, which are of great significance for the cancerreearches based on these two high-throughput technologies. |