| Liver cancer is one of the leading causes of cancer-related deaths worldwide.Patients with liver cancer usually have no obvious symptoms in the early stages and limited tumor markers are available for early diagnosis of liver cancer,making early diagnosis of liver cancer very difficult.In addition,intermediate and advanced liver cancer is extremely risky to treat and has a poor prognosis.Therefore,there is a need for research to identify new biomarkers relevant to its early diagnosis and to develop an effective disease diagnosis system.To investigate the characteristic differential genes closely related to the development of liver cancer and to develop a machine learning-based liver cancer diagnosis model,an empirical analysis was performed using a random forest + neural network hybrid machine learning approach for several liver cancer gene datasets.A combined dataset of TCGA and GTEx databases that had been batch corrected was downloaded from the UCSC official website,and tumor and normal sample data of liver tissues were selected from it as one dataset;a gene expression dataset of liver cancer was downloaded from GEO database and its batch effect was eliminated using the COMBAT method as another dataset;patients with liver cancer were downloaded from TCGA database clinical information for survival analysis.The differential genes of the two different datasets were first screened under the set statistical conditions according to the limma package in R language using empirical Bayesian and linear modeling methods.The intersection of the two differential gene sets was used as the basis for further investigation study.The characteristic differential genes associated with hepatocarcinogenesis were further screened using random forest,and enrichment analysis and survival analysis were done to investigate the biological significance of the characteristic genes and their relationship with the survival time of hepatocarcinoma patients.The TCGA + GTEx dataset was used as the training set and the dataset from GEO was used as the test set,and the expression of the characteristic genes was binarized and input as variables into the BP neural network model,and the neural network model was used to diagnose hepatocellular carcinoma.To further validate the experimental results,a new liver cancer gene expression dataset was re-downloaded from the GEO database,and the feature genes were extracted and input into the neural network model for validation after eliminating its batch effect,and finally the accuracy of the model was identified based on the confusion matrix and AUC values of the model.Under the conditions of P-value<0.05,FDR<0.01,and log FC absolute value greater than2,2154 differential genes associated with liver cancer were screened from the TCGA + GTEx dataset and 274 differential genes associated with liver cancer were screened from the GEO dataset,and 138 common differential genes were obtained by taking the intersection of the two gene sets.Ten genes with the highest importance scores were selected from the 138 differential genes by the Gini score of the random forest algorithm,in which CLEC4 M and CLEC1 B were down-regulated in the disease group,and GMNN,NDC80,CAP2,COL15A1,KIF20 A,CCNB1,RACGAP1,and RRM2 were up-regulated in the disease group.Enrichment analysis of them revealed that the signature genes were mainly expressed in four areas: PID Aurora B pathway,mitosis in cell cycle,coagulation,cell morphogenesis,and enriched to two pathways,p53 signaling,and C-type lectin receptor signaling,which were significantly associated with a total of 253 molecular functions,cellular components,and functions in biological components.Single gene survival analysis of 10 signature genes revealed the expression of seven genes,CCNB1,CLEC1 B,COL15A1,KIF20 A,NDC80,RACGAP1,and RRM2,were significantly associated with survival time in hepatocellular carcinoma patients.The 10 characteristic genes were binarized and input as variables into the BP neural network model for diagnosing liver cancer,and the results showed good model performance with accuracies of 0.983,1.000,and 0.868 in the training set,test set,and validation set,respectively.The results show that 10 characteristic difference genes selected according to the random forest algorithm are related to the occurrence and development of liver cancer,and the liver cancer diagnosis model using characteristic difference genes as variables combined with BP neural network can accurately identify cancer samples with an accuracy of up to 1.The advantages shown by these characteristic genes in the subsequent analysis indicate their reference value for the study of liver cancer,and the implementation of the diagnosis model can also help medical professionals to develop a diagnostic system for liver cancer,and this method can also provide ideas for gene mining for early diagnosis of other diseases. |