| According to the latest cancer statistics report,my country’s lung cancer mortality ranks first,accounting for one-fifth of cancer deaths.With the development of the country’s economy and society,the incidence of cancer in China is changing from a developing country to a developed country,and the high incidence of lung cancer is particularly prominent in males.Most of the confirmed cases are lung adenocarcinoma,lung squamous cell carcinoma and small cell lung cancer.Different lung cancer types have their own specific treatment methods,so it is necessary to accurately understand the patient type before treatment and prescribe the right medicine.At present,the detection methods of lung cancer types in clinical use are mostly invasive methods such as puncture and surgical extraction of tissue.The invasive methods have the risk of complications and adversely affect the treatment of lung cancer patients.With the geometric growth of data information,a large amount of medical data provides the possibility for digital diagnosis.It is of great significance to establish a complete set of non-invasive lung cancer subtype prediction models as an auxiliary diagnosis and treatment method.Based on the data of admitted lung cancer patients recorded in a domestic tertiary hospital,this paper proposes a non-invasive lung cancer type diagnosis scheme,which uses machine learning methods to predict lung cancer subtypes.The main contents include the following aspects:(1)Select appropriate data preprocessing methods according to the characteristics of medical data.Medical data has the characteristics of clutter,irregular data records,serious data missing,and even missing sample labels.These problems have caused great difficulties for the construction of classification models.This paper uses the K nearest neighbor imputation method to preprocess the missing values and solve the problem of missing data.Secondly,medical data has imbalanced data due to its initial probability of onset.In this paper,the SMOTE oversampling method is used to balance the data.(2)Based on different methods,the optimal feature subset is selected.The data in this paper contains more than 60 indicators such as patient diagnostic information,laboratory indicators,and chronic disease history.Different machine learning models require different characteristics.In this paper,three major categories and five subcategories of feature selection methods are selected for feature extraction,including filtering methods(Correlation Coefficient Method,Mutual Information Method,Relief-F,etc.),wrapping methods(Forward Selection,Backward Selection,Global Search)and embedded methods(LASSO,Ridge Regression).(3)A prediction model of lung cancer classification based on machine learning method is proposed.In this paper,three machine learning methods are selected,including support vector machines,random forests and probabilistic neural networks,combined with feature selection methods to build predictive models.In this paper,the precision rate,recall rate and AUC value are selected as evaluation indicators,and finally the random forest combined with Relief-F feature selection method has better prediction effect. |