Font Size: a A A

Research On Colorectal Cancer Prediction Model Based On Feature Selection

Posted on:2020-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:D D ZhaoFull Text:PDF
GTID:2434330575459484Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Colorectal cancer(CRC)is one of the most susceptible cancers in the digestive system.According to statistics,more than 1.2 million people suffer from CRC every year in the world.The number of people who die from CRC accounts for about half of the patients,up to 600,000 people.It not only poses a serious threat to human health,but also causes huge losses to the national economy.Current diagnostic methods for colorectal cancer,such as X-ray,serum CEA,B-mode ultrasonography and endoscopy undoubtedly play an important role in the diagnosis of colorectal cancer,but they depend on the experience of doctors',which is difficult to ensure the accuracy,and also increases the work intensity of medical staff.In view of the limitations of the above diagnostic methods for colorectal cancer,the prediction models incorporating machine learning algorithm have gradually become a research hotspot.The intelligent performance of machine learning algorithm in the field of disease prediction is that it learns medical data actively,and more importantly,makes final decisions based on the constructed models.It is of great significance to improve the accuracy and real-time of disease diagnosis and reduce the work intensity of medical staff.However,a single machine learning algorithm may not achieve better generalization ability when classifying unknown data.It is necessary to consider the integration and optimization of multiple technologies.At present,there are many problems in the prediction of colorectal cancer using machine learning algorithms,such as redundancy of feature factors,improper feature selection,improper classifier selection and imbalance of data samples,which lead to the poor performance of some machine learning algorithms.In order to improve the prediction accuracy of the model and apply it to practice better,it is significant to consider the selection of important features of colorectal cancer and the performance of the classifier.Supported by the National Natural Science Foundation of China(Project No.61876102,61272094 and 61472232),we have made further research on the prediction model of colorectal cancer.In order to predict colorectal cancer more accurately using machine learning algorithm and overcome the problems of feature selection,classifier and data sample imbalance in the process of colorectal cancer prediction,we establish SVM and MKFSVM prediction models based on low-dimensional data characteristics of intestinal flora and high-dimensional data characteristics of gene microarray respectively.These prediction models provide a basis for feature and model selection in different dimensions of data,and also provide a new thought for the prediction of colorectal cancer.The main work and innovations of this paper are as follows:1.Aiming at the problem of improper feature selection of traditional low-dimensional data for colorectal cancer,a feature selection method based on logistic regression model and ROC curve is proposed.Firstly,from the statistical point of view,the logistic regression algorithm is used to select the significant factors from factors related to colorectal cancer(p < 0.05).Then,the ROC curve is used to select the combination factors that most affect the disease as the input of SVM according to the value of the AUC.In addition,the optimal kernel function is selected by comparing the effects of four different kernel functions in SVM on the classification results.And finally,the accuracy of the prediction model is improved.2.To solve the problem of feature dimension,redundant genes and unrelated genes in microarray data of colorectal cancer,a feature selection method combining differentially expressed genes and mRMR algorithm is proposed.Gene microarrays in colorectal cancer samples contain thousands of genes,but many genes are redundant or unrelated to the disease.Once these genes are brought into the classifier,they will affect the predictive performance of the classifier.Therefore,we propose a feature selection method based on differentially expressed genes and mRMR algorithm,using differentially expressed genes to determine the most relevant factors,and then using mRMR algorithm to select the optimal feature combination from differentially expressed genes,so as to improve the predictive results of colorectal cancer.3.Aiming at the limitation of single kernel function of SVM,a mixed kernel function based on RBF kernel function and polynomial kernel function is proposed.RBF kernel function has strong local optimization ability,and polynomial kernel function has strong global search ability.However,in the existing SVM prediction models based on microarray of colorectal cancer genes,most of them use one of the RBF kernel functions or polynomial kernel functions to construct SVMs.In this way,the support vector machine occupies either the ability of local optimization function or global function.Therefore,in this paper,we propose a mixed kernel function which combines the RBF kernel function and polynomial kernel function.The mixed kernel has the ability of local and global optimization at the same time.In addition,the whale optimization algorithm,a new optimization algorithm,is used to optimize the parameters,which is very helpful to improve the classification performance.4.To solve the problem of class imbalance in colorectal cancer gene samples,a colorectal cancer prediction model based on RUSBoost algorithm is built.Samples selected in many studies tend to be balanced(classification ratio approaches 1:1),but in many cases,the number of different classification of samples varies greatly,which has a great impact on the classification results.In order to balance the number of samples in different classification,this paper uses RUSBoost algorithm to create appropriate composite examples for a small number of categories,so as to indirectly change the update weight and compensate the deviation distribution to achieve the balance between samples and improve the overall performance of the prediction model.In summary,starting from the low-dimensional data characteristics of intestinal flora and the high-dimensional data characteristics based on gene microarray,we propose an SVM prediction model which combines logistic regression algorithm and ROC curve double feature selection,and a MKFSVM prediction model which combines differential expression gene and feature selection method of mRMR algorithm.Meanwhile,in gene microarray data,the effect of unbalanced class on prediction results is fully considered.RUSBoost algorithm is applied to the prediction model.By comparing with other methods,it is proved to improve the prediction performance of colorectal cancer very well.It provides a basis for feature selection and model selection in different dimensions of data,and also provides a new way of thinking for the prediction of colorectal cancer.In addition,the purpose of our prediction model is to improve the accuracy of disease prediction,which can be applied to the diagnosis of different diseases based on text data to assist medical staff and reduce their work intensity.
Keywords/Search Tags:Colorectal Cancer, Feature Selection, Support Vector Machine, Kernel Function, Parameter Optimization, Unbalanced Data
PDF Full Text Request
Related items