Font Size: a A A

Research On Clustering Ensembles Based Classification Method For Imbalanced Data Sets And Its Application

Posted on:2019-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:F DingFull Text:PDF
GTID:2404330542972058Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the improvement of technologies such as data acquisition and data storage,the data of various industries show an explosive growth.At the same time,the types of data also show diversified development.Imbalanced data is a class of widespread data which is widely present,and appears in many field,such as medical disease diagnosis,anti-intrusion system on the network,text classification,etc.In the Imbalanced data,the minority classes are always been the focus of attention which have high research vlaue.The traditional classification algorithms are more concerned with the overall classification of data,which have high classification accuracy for majority classes,but generally having low classification accuracy for minority classes.However in practice the minority classes are often the focus of our attention,and play a key role in the imbalance data.In this case,this paper analyzes the reason why the traditional machine learning algorithms are not accurated for miniroty classes.Based on the k-means clustering algorithm,this paper proposes the REKM algorithm which based on clustering ensembles to improve the imbalanced degree.Through the experimental analysis of the UCI datasets,the random forest classification algorithm has a certain degree of improvement to the accuracy of minority classes and the overall classification effect of the data set which is processed by the REKM algorithm.Then the REKM-RF algorithm been used to predict post-operative life expectancy in the lung cancer patients.The results show that the Recall and F-measure of the REKM-RF algorithm are improved by 42%and 23%,respectively,compared with the imbalanced data without data processing,and the Recall and F-measure of the REKM-RF algorithm are improved by 40%and 20%,respectively,compared with the imbalanced data with the initial random sampling processing.Finally,the influencing factors of primary lung cancer were analyzed by REKM-RF algorithm,which could be used for the prevention and treatment of lung cancer patients.
Keywords/Search Tags:Imbalanced data, Classification, Clustering ensembles, Kmeans clustering, Primary lung cancer
PDF Full Text Request
Related items