| High throughput technology has greatly enhanced research on complex disease such as cancer. Based on high throughput data, using data mining to find disease biomarkers has great value on prognostic and drug target design. However, for the same disease, biomarkers from different studies were always inconsistent, which raised the doubts about their biological significance or clinical implication and the reliability of high throughput technology. In this study, we analyzed the low reproducible problem of cancer biomarkers based on high throughput data from three aspects:1. The low reproducible problem is usually attributed to deficiency in experimental designs, different platforms and statistical analyses of disease biomarkers. However, it is very likely that the inconsistency of biomarkers discovered from different cancer samples for a particular cancer might reflect the biological variation and heterogeneity of the cancer. They could be probably tracking a common set of biologic phenotype. Using two datasets for breast cancer metastasis, firstly we showed although biomarkers (hubs) identified from different studies have little overlap, they were highly consistent in terms of significantly sharing interaction neighbors, whereas the shared interaction neighbors were significantly over-represented with known cancer genes and enriched in pathways deregulated in breast cancer pathogenesis. Then, we showed that the biomarkers (hubs) identified from the two datasets were highly reproducible at the protein interaction and pathway levels in three other independent datasets. At last, our results provide a possible biological model that different signature hubs altered in different patient cohorts could disturb the same pathways associated with cancer metastasis through their interaction neighbours.2. Hundreds of genes with differential DNA methylation of promoters have been identified as biomarkers for various cancers. However, the reproducibility of differential DNA methylation discoveries for cancer and the relationship between DNA methylation and aberrant gene expression have not been systematically analysed. Using array data from seven types of cancers, we find for a given cancer, the directions of methylation and expression changes detected from different datasets, excluding potential batch effects, were highly consistent. In different cancers, DNA hypermethylation was highly inversely correlated with the down-regulation of gene expression, whereas hypomethylation was only weakly correlated with the up-regulation of genes with large expression changes. Finally, we found that genes commonly hypomethylated in different cancers primarily performed functions associated with chronic inflammation, such as‘keratinization’,‘chemotaxis’and‘immune response’.3. Because of the heterogeneity of cancer, it might need thousands of samples to find a few reproducible individual biomarkers. Considering genes with similar functions (such as the genes in the same pathway) tend to express in a correlated fashion, we proposed a feature selection method by combing gene expression and Gene Ontology(GO)database as function modules. By analyzing seven cancer datasets, we showed that, in each cancer, a wide range of functional modules have altered gene expressions and thus have high disease classification abilities. The results also showed that seven modules are shared across diverse cancers, suggesting hints about the common mechanisms of cancers. Therefore, instead of relying on a few individual genes whose selection is hardly reproducible in current microarray experiments, we may use functional modules as functional signatures to study core mechanisms of cancers and build robust diagnostic classifiers. |