| Lung cancer is one of the malignant tumors with high morbidity and high mortality around the world.The situation of lung cancer is more severe in China.Its morbidity and mortality have taken the lead.Lung adenocarcinoma(LUAD)is a kind of lung cancer whose morbidity is increasing sharply year by year.LUAD is seldom cured because of lesion metastasis and delayed admission.There are new opportunities for the prevention and treatment of LUAD in early stage along with the developments of gene detection and diagnosis.However,there are large quantities of genes.It will increase cost and decrease efficiency supposing that we sequence the whole genome for every patients.We would like to reduce the work of gene detection by gene classification which bases on the similarity between similar genes.In this paper,the study is based on gene data of LUAD patients from the National Center for Biotechnology Information(NCBI).The statistical methods and machine learning methods are used to construct several classifiers to classify the key genes.In the course of study,a Factor Analysis-SMOTE-KNN/Logistic Regression/AdaBoost classification model is proposed innovatively,where Factor Analysis is used for classification and annotation,SMOTE is used to balance data and the classification algorithm is used to build multi-class classifiers.In this way,it achieves better classification performance and provides a basis for clinical gene screening to save money and time.The main work is shown as following:(1)Preprocess for LUAD gene data.Firstly,this paper chooses three gene data sets S1,S2 and S3 whose fold change value satisfies |logFC|>3.5,|logFC|>2.5 and |logFC|>1.5 separately when P<0.001.Secondly,it’s adverse for factor analysis which is because the quantity of genes is much larger than that of samples.We increase the sample size by SMOTE so that the factor analysis can be carried out.Lastly,we use Min-max normalization to normalize the data which could eliminate the influence of dimension and magnitude.(2)Mark the category for three gene data sets.In this paper,factor analysis is adopted to analyse genes which are regarded as variables.We classify and label genes according to the load on the common factors,and then we obtain three data sets with annotations.(3)Compare the performance of classification methods.In this paper,ten fold cross validation is used to divide three data sets into training sets and test sets separately.We recognize accuracy,macro precision,macro recall and macro F1 as evaluation criteria to compare the performance of KNN,Logistic Regression and AdaBoost.Moreover,we attempt to explore the impacts on classification performance due to the size of data sets,K value in KNN,regularization intensity A value in Logistic Regression and number of weak classifiers in AdaBoost.(4)Eliminate the impacts of class imbalance by SMOTE.In this paper,we find that the macro precision,macro recall and macro F1 are quite low when the accuracy is acceptable.The reason is that there are class imbalance problems when we use factor analysis to classify genes.Thus,we try to eliminate the impacts of class imbalance by SMOTE,and construct SMOTE-KNN,SMOTE-Logistic Regression and SMOTE-AdaBoost.We still adopt ten fold cross validation and four evaluation indices,and attempt to explore the impacts on classification performance due to the size of data sets,K value,λ value and number of weak classifiers.The results show that the four evaluation indices are all improved,as well as the gap between accuracy and other three has obviously narrowed.(5)Classifications for LUAD samples based on key genes.In this paper,we choose the genes with the highest common factor load in each type as key genes.KNN,Logistic Regression and AdaBoost are used to classify tumor samples and normal samples based on these selected genes.The accuracy of three algorithms is over 85%.In addition,the accuracy of KNN and Logistic Regression is over 90%.The results show that the key genes obtained in this paper are effective for sample recognition.Moreover,MMP1、ENTPD8、RTKN2 and STRA6 have been selected as key genes repeatedly,which means they are of high importance and we would like to pay more attention to them in clinical gene screening.In conclusion,accuracy and macro F1 achieve the best when K=1 in KNN,regularization intensity λ=0 in Logistic Regression and the number of weak classifiers is {20,40,60} in AdaBoost before we use SMOTE.And accuracy and macro F1 has been greatly improved compared to former when K E {4,5,6},λ∈ {0,1} and the number of weak classifiers is {20,50,60} after SMOTE is adopted.As the size of data sets increase,the Matlab running time of the six algorithms increases.Obviously,the algorithms with SMOTE take more time than that without SMOTE when it comes to the same data set.KNN costs the shortest running time,while AdaBoost with the longest running time whether we use SMOTE or not.KNN is suitable for smaller data sets,while AdaBoost has some advantages in accuracy and macro F1 value for larger data sets. |