Font Size: a A A

Imbalanced Data Classification And Its Application In Cancer Recognition

Posted on:2013-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhangFull Text:PDF
GTID:2234330374994379Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Abstract: Machine Learning is widely applied to tumor classification, but sampleimbalance problem is often ignored. The number of tumor data is only a few dozenor more, while the number of dimensions often reaches thousands, and thedistribution of the sample class is imbalanced, fully reflecting gene expression datawith the features of small sample, high dimension, imbalanced distribution, et al.Tumor classification, related to patients’ lives, to a large extent directly determineswhether patients would get proper treatments at the best time. To solve theimbalanced data classification, mainly from two aspects: one is changing thedistribution of training samples, the other is transforming the existing algorithms orproposing new. In this dissertation, over-sampling and extreme learning machineensemble are studied and applied to imbalanced data classification. The maincontents are summarized as follows:(1) Comparing among cost-sensitive learning, over-sampling andunder-sampling, experiments show that the classification performance ofover-sampling is better than cost-sensitive learning and under-sampling for smallsamples and imbalanced data.(2) An over-sampling method based on feature selection called FS-sampling.FS-Sampling regards the importance of features are not the same, using featureselection to choose important features and the theory of SMOTE to synthesize theminority. The synthetic of minority maintains the key features and changes others.Experiments confirm that FS-Sampling is prior to SMOTE and it can increase theclassification accuracy of the minority with little effect on the overall classificationaccuracy.(3) An extreme learning machine ensemble method based on sample set partitioncalled DS-E-ELM. DS-E-ELM evenly divides the original training set into k-disjointsubsets, then combines a new training set with k-1subsets, thus get k new trainingsets. Training the new training sets to obtain k different base classifier later. Finally, deciding the output label by majority voting. Experiments proved that DS-E-ELM notonly increases the classification accuracy of the minority, but also maintaines a lowtime complexity with better instability.
Keywords/Search Tags:Tumor classification, Imbalanced data, Over-sampling, Extreme learningmachine, Feature selection
PDF Full Text Request
Related items