Font Size: a A A

Research On Cancer Classification Problem Based On Mutual Information Redundancy And Multiple Classification Models

Posted on:2019-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:X D LiuFull Text:PDF
GTID:2404330545973831Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increase of cancer incidence and the high mortality rate of cancer,more and more researchers begin to focus their attention on cancer classification.The traditional cancer classification problem is mainly based on morphology,mainly depending on experience,and the accuracy of diagnosis is not high.With the advent of gene chip technology,a large number of gene expression data have been measured,which makes it possible to diagnose cancer early from the gene level.However,the data of gene expression have the characteristics of small sample,high dimension and unbalance of data distribution.How to pretreat these data effectively,achieve the purpose of reducing the dimension through feature selection,and establish a classification model with high classification accuracy have attracted the attention of many scholars.In the field of cancer classification,this paper first proposes a classification method based on mutual information redundancy and various classification models.First,the data are preprocessed by the method of undersampling to prevent the data debris and the inappropriate inductive bias caused by the unbalance of the sample.Then,the selection of the characteristic genes is carried out by the information gain method,thus reducing the data dimension and removing the interference from the unrelated features and the shadow of the performance.Then,the mutual information method is used to remove redundant genes.Finally,the final feature gene set is used to construct the cancer classification model.In this paper,the classification method of mutual information removing redundancy and multiple classification models is applied to the classification of gene expression profiles.Data preprocessing,selection of characteristic genes,removal of redundant genes and classification models are explored through experiments and comparative studies.The Kent Ridge dataset and TCGA breast cancer dataset are predicted.The experimental results show that the classification method proposed in this paper is superior to the method of feature selection using information gain.In the classification accuracy and using SVM as a classifier,the selection of feature genes is more than information gain by using mutual information and redundancy.In the five data sets,five genes were selected as characteristic genes,in which the redundant genes were removed more and 17 genes were removed on the Breast Cancer dataset.The selection method of feature gene selection for information removal has a certain improvement compared with the feature selection method using information gain.On the Colon Cancer data set,the classification accuracy is increased by 6.7%,the Breast Cancer data set,the classification performance is reduced by 0.9%,the other data sets have a certain improvement in classification accuracy;in the classification model,In the construction aspect,the study found that the classification performance of different classifiers in different data sets using this feature selection method has advantages and disadvantages,in which KNN shows better classification performance in 2 data sets,while SVM shows better classification performance in 3 data sets.The experimental results show that the combination of different feature selection algorithms and different classification models has different performance,and the feature selection method uses the feature selection method of mutual information redundancy to solve the problem of redundancy in feature selection methods.
Keywords/Search Tags:Cancer classification, data preprocessing, feature selection, redundancy feature removal, information gain
PDF Full Text Request
Related items