| Background:Esophageal cancer(EC)is a common tumor of the gastrointestinal tract.Its insidious onset and lack of obvious signs in the early stages mean that most patients are diagnosed with advanced tumors,resulting in a poor prognosis for EC with a 5-year survival rate of only 15-25%.Currently,endoscopy combined with histopathological biopsy is the gold standard for the diagnosis of EC,however,this method is time consuming and not easily used for widespread screening of EC.Serum tumor marker testing is a simple,low-cost diagnostic method that is,in addition,less invasive and more acceptable to patients.However,the current markers used for EC diagnosis,such as CEA,CA 19-9 and CA 12-5,have low sensitivity and specificity and cannot be used as a basis for EC diagnosis.Volatile organic compounds(VOCs)are important components of human metabolites,reflecting changes in the pathophysiology and metabolic state of the body,and have been detected in urine,serum,bile and exhaled breath.VOCs are produced by tumor cells and are ultimately excreted through the breath or in urine or faeces.It has been found that the VOCs produced by different tumor cells can reflect different diseases,so it is important to explore the differences in VOCs released by different cancer types in order to identify tumor-specific VOCs that can be used as diagnostic tools.Since 1971,when Linus Pauling et al.pioneered the use of gas chromatography-mass spectrometry(GC-MS)to describe the presence of hundreds of VOCs in human exhaled breath and urine,an increasing number of VOCs-related studies have been applied to the screening of markers for various tumors.Objective:By comparing the differences in the composition of urine and serum VOCs in EC and healthy controls(HCs),we screen for potential EC-specific markers and construct a diagnostic model for EC,providing a research direction for the clinical application of VOCs analysis for the diagnosis of EC.Methods:Gas chromatography-ion mobility spectrometry(GC-IMS)was used to analyse the composition of VOCs in the urine or serum of recruited subjects with a diagnosis of EC and HCs.Urine samples were obtained from 125 patients with EC and 107 HCs.Serum samples were obtained from 55 patients with EC and 84 HCs.The VOCs are characterised according to their ion mobility spectra and subsequently diagnostic models are constructed using four popular machine learning algorithms:Random forests(RF),Neural networks(NN),Support vector machines(SVM)and Decision trees(DT).To further screen the VOCs and optimise the predictive power of the diagnostic model for esophageal cancer,the top few VOCs were selected for the final model construction by analysing the Gini coefficients of the VOCs.In addition,the correlation between VOCs and esophageal cancer was analysed in conjunction with clinical data.Results:The final VOCs were successfully analysed in urine or serum of all subjects,of which 37 VOCs signal peaks were identified in urine and 33 VOCs signal peaks in serum.A urine-clinical characteristics database and a serum-clinical characteristics database were established based on patient information.The clinical characteristics included age,gender,height,weight and pathological stage.Based on all the signal peaks of VOCs found,four machine learning algorithms,RF,NN,SVM and DT,were used to construct the models.Among them,the best diagnostic model was constructed based on the RF algorithm,both in urine and in serum.Subsequently,the Gini coefficient of all VOCs was measured by the RF algorithm,and the top 8 peaks of VOCs signals were finally selected.The RF algorithm was used to analyse the peak heights of the top 8 ranked VOCs in urine or serum to construct a new diagnostic model.The urine RF diagnostic model consisted of eight VOCs:Cyclohexanone-D,2,3-Butandiol,2-Acetylfuran,Dimethyl trisulfide,2-Methyl-butanoic acid methyl ester,Methyl decanoate,(E)-Ethyl-2-hexenoate,and 2-Isopropyl-3-methoxy pyrazine.The AUC of the model reached 0.874,with sensitivity and specificity of 84.2%and 90.6%,respectively.The serum RF diagnostic model consisted of five VOCs represented by the peak heights of eight VOCs:3-nonen-2-one,Butanol-1,Butanol-2,methyl 3-(methylthio)propanoate,(E)-3-hexen-1-ol-1,(E)-3-hexen-1-ol-2,(E)-3-hexen-l-ol-3,and 1-Hexanol,and the model achieved an AUC of 0.951,sensitivity of 94.1%and specificity of 96.0%.Conclusions:(1)The urine or serum VOCs composition of the subjects was analysed by GC-IMS and 37 or 33 of these VOCs peak heights were identified,respectively,and significant differences in the levels of VOCs in the urine or serum of EC patients and HCs were found.(2)Based on the analysis of urine or serum data for EC and HCs by RF in machine learning algorithms,the EC screening model developed has good predictive power and the screened VOCs can be used as potential diagnostic markers for EC. |