| Background and ObjectivesNon-small-cell lung cancer (NSCLC) patients with the same TNM stage may suffer from large prognosis variations. Even patients with early-stage NSCLC still demonstrated lower-than-expected survival rates after surgical resection, indicating that our current TNM staging methods do not adequately predict outcome. Studies focusing on tumor biologic characteristics came into being because of demand, in order to identify prognostic gene signatures.As a hot topic in this field, a variety of related researches have been reported. However, IASLC International Staging Committee didn't include these results in the latest 7th edition of TNM stage. Mainly because the results were still not stable enough. For example, prognostic model established on the basis of training cohort could fairly pass the validation of independent validation cohort, at least came up with reduced sensitivity and speficity.Our research focused on the early-stage NSCLC prognostic related biological characteristics. On the one hand, we routinely established machine learning prognostic model. On the other hand, we also tried to focus on the rules revealed by the data instead of the vast amout of data itself. We tried to study early-stage NSCLC prognostic related biological functions. Functional cluster was chosen as the cut-in point of array data analysis to establish prognostic model for early-stage NSCLC MethodsPart 1 Early-stage NSCLC prognostic model established by machine learning process120 NSCLC samples and 53 matched normal tissues were obtained with informed consent between April 2003 and June 2006. Follow-up was routinely carried out and detailed recorded. Tumor samples had been purified based on histological assessment in order to ensure the tumor cells came up to more than 80% in total. Affymetrix U133 Plus2.0 array was used to perform gene expression analysis in high-throughput.Our research focused on early-stage NSCLC prognostic model establishment.86 samples (stageâ… -â…¡) were enrolled in total. For the samples with OS less than 2.5 years, they were assigned into high-risk group. For the samples with DFS more than 5 years, they were assigned into low-risk group.50 samples with definite prognostic subgroup were analyzed to establish the model.127 samples (had 105 samples in common with the 120 samples above) obtained in the same period were processed by Agilent Oligonucleotide Array-Based CGH for Genomic DNA Analysis to obtain the results of copy number aberration (CNA).Prognostic model was established by routine machine learning process. First step was candidate genes screening according to the 3 requirements. Second step was RS formula establishment according to forward selection process.(?) Differentially expressed genes between matched tumor and normal tissues(?) Genes with CNA in more than 10% tumor samples(?) Genes with P<0.05 in univariate cox regression of all samplesThe prognostic efficacy of the machine learning prognostic model has been assessed in training and validation cohort respectively.Part 2 Early-stage NSCLC prognostic model established by gene functional clusterExpreesion array and CGH array were executed in the same process as part 1. For the CGH array data, we focused on CNA with more than 3% of samples. Biological functions correlated with NSCLC were assessed by genes with focal amplification/focal deletion. For the expression array analysis, our research focused on biological functional cluster in order to identify the prognostic correlated funtions and representative genes. We established the model manually in order to show the early-stage NSCLC prognostic related functions in full scale. The functional prognostic model has also been assessed by independent validation cohort, in order to provide insights for further refinement.ResultsPart 1 Early-stage NSCLC prognostic model established by machine learning process1. Establishment of machine learning prognostic model22 candidate genes had been identified by the following three canditions:(?) Differentially expressed genes between matched tumor and normal tissues: 2383 genes(?) Genes with CNA in more than 10% tumor samples:953 genes(?) Genes with P<0.05 in univariate cox regression:1381 genesRS formula acquired by forward selection analysis: RS=(CLDN11×0.777)+(SATB1×1.379)+(ANLN×1.334)+(NUF2×-0.651)2. Validation of machine learning prognostic model in training cohort(1) All 50 samples:Results of log rank test, P=0.000; specificity:24/28=85.7%; sensitivity: 18/22=81.8%; accuracy:(24+18)/50=84.0%.(2) Samples in the same TNM stage (38 samples in stage I):P=0.000; specificity:22/25=88.0%; sensitivity:11/13=84.6%; accuracy: (22+11)/38=86.8%.(3) 18 AC samples in stage I:P=0.000; specificity:13/14=92.9%; sensitivity:3/4=75.0%; accuracy: (13+3)/18=88.9%.3. Validation of machine learning prognostic model in Lee et al dataset(1) All 70 samples: P=0.013; specificity:23/35=65.7%; sensitivity:23/35=65.7%; accuracy: (23+23)/70=65.7%.(2) 31 AC samples:P=0.072; specificity:11/18=61.1%; sensitivity:10/13=76.9%; accuracy: (11+10)/31=67.7%.(3) 39 SCC samples:P=0.063; specificity:12/21=57.1%; sensitivity:13/18=72.2%; accuracy: (12+13)/39=64.1%.4. Validation of 4 genes cox regression model (machine learning prognostic model chosen genes) in Lee et al, Hou et al and Bild et al datasets(1) Lee et al dataset (all 70 samples)RS formula: RS= (CLDN11×0.079)+(SATB1×0.065)+(ANLN×0.681)+(NUF2×-0.353)P=0.004; specificity:23/35=65.7%; sensitivity:23/35=65.7%; accuracy: (23+23)/70=65.7%.(2) Bild et al dataset (all 52 samples):RS formula: RS= (CLDN11×-0.019)+(SATB1×0.110)+(ANLN×0.275)+(NUF2×-0.074) The model failed to predict sample prognosis, P=0.892.(3) Hou et al dataset (all 48 samples):RS formula: RS=(CLDN11×-0.029)+(SATB1×-0.014)+(ANLN×-0.070)+(NUF2×0.264) The model failed to predict sample prognosis, P=0.713.5. Limitations of candidate genes screening process of machine learning prognostic model establishment2 separate experiment groups (each groups had 48 samples in total) were obtained by removing 2 samples from the original cohort (50 samples) randomly. Results of the Univariate cox regression showed that:(1) Genes with P<0.05 in experiment group 1:1358 genes. (2) Genes with P<0.05 in experiment group 2:1359 genes.(3) Genes showed to be P<0.05 both in experiment group 1 and group 2:1130 genes.(4) Genes showed to be P<0.05 both in experiment group 1 and original cohort (50 samples):1186 genes.(5) Genes showed to be P<0.05 both in experiment group 2 and original cohort (50 samples):1240 genes.(6) Genes showed to be P<0.05 in 3 groups:1113 genes. There were 268 genes in total to be different with 1381 genes.Part 2 Early-stage NSCLC prognostic model established by gene functional cluster1. NSCLC samples showed apparent DNA copy number abberationsNSCLC related biological functions based on CGH data analysis were:cell proliferation/differentiation, cell cycle, cell apoptosis, cell adhesion, immune response et al.2. Early-stage NSCLC prognostic correlated biological funtions and representative genes based on functional clusterCell cycle correlated genes:ANLN, BUB1B and CDCC99 genes;Cell proliferation correlated genes:DUSP4, STIL and MKI67 genes;Cell adhesion correlated genes:HMMR and CD9 genes;Cell apoptosis correlated genes:KIAA0101 and BIRC5 genes;Immune response correlated genes:CD1A and C5 genes;Blood coagulation correlated genes:F12 and PGDS genes;Metabolism correlated genes:LPGAT1 and PPARGC1A genes。Among them, cell cycle and cell proliferation correlated genes were the most important genes, yet they were still not strong enough to represent other genes with different biological functions.3. Validation of functional prognostic model(1) For training cohort:RS formula Rs=(MKI67*-1.227)+(ANLN*1.296)+(BUB1B*0.700)+(CCDC99*2.048)+(DUSP4 *-0.853)+(STIL*-2.255)+(HMMR*-.483)+(CD9*-2.083)+(KIAA0101*2.907)+(BIR C5*-1.371)+(CD1A*0.108)+(C5*-1.333)+(LPGATl*1.853)+(PPARGC1A*1.765)+( F12*-0.393)+(PGDS*0.246)Log rank test P=0.000;specificity:25/28=89.3%;sensitivity:19/22=86.4%; accuracy:(25+19)/50=88.0%.(2)Lee et al dataset:RS formulaRS=(MKI67*-0.024)+(ANLN*0.414)+(BUB1B*0.986)+(CCDC99*0.765)+(DUSP4 *-0.001)+(STIL*0.762)+(HMMR*-.429)+(CD9*-0.261)+(KIAA0101*-0.401)+(BIR C5*-0.490)+(CD1A*0.291)+(C5*-0.316)+(LPGAT1*-0.142)+(PPARGC1A*0.796)+( F12*-0.009)+(PGDS*0.648)Log rank test P=0.000;specificity:26/35=74.3%;sensitivity:26/35:74.3%; accuracy:(26+26)/70=74.3%.4.Analysis for the functional model with specificity and sensitiviyt need to be upregulatedOrdering the samples with descending RS values,it turned out that the samples with wrong judgement all assembled in the middle gray area.The width and diffusion of the gray area were highly correlated with model prognosis assessment ability.This provided insights for further refinement of the prognostic model.Primary conclusions1.Machine learning prognostic model can predict the prognosis of training cohort samples(with sensitivity and specificity more than 80%,independent from TNM staging system).But the efficacy of the model in independent validation cohort is not high enough,even in Asian population.2.2 genes of the machine learing model(4 genes in sum)are cell cycle correlated genes,indicate that the cell cycle and cell proliferation genes exerted the highest correlation with early stage NSCLC prognosis.Yet only cell cycle and cell proliferation genes are still not enough to make a definite conclusion.There are still other 5 kinds of biological functions correlated with NSCLC prognosis:cell adhesion, cell apoptosis, immune response, metabolism and blood coagulation related genes.3.16 representatives of 7 biological functions were used to establish functional prognostic model, which showed much better prediction efficacy in independent validation chort.4. Analysis of functional model with specificity and sensitivity need to be upregulated showed that the samples with wrong judgement all assembled in the middle gray area. The width and diffusion of the gray area were highly correlated with model prognosis assessment ability.5. Prognostic related genes can be divided into two sets, genes with positive effects (Geneprognosis-positive) and genes with negative effects (Geneprognosis-negative).The final results relied on the struggle of the two gene sets. If one of them is not stronger than the other, then the sample should be grouped into grey area. This provided insights for further refinement of the prognostic model:(1) Try to narrow down the gray area(2) Try to identify the cut-off of the gray area between high-risk group and/or low-risk group... |