Font Size: a A A

Selectively Mining Approach With Dynamical Chunk Size For Imbalanced Data Streams In Nonstationary Environment

Posted on:2018-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:N N LiuFull Text:PDF
GTID:2348330542960095Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Compared with the traditional static data,data streams have the characteristics of real-time,massive,single scanning and dynamic property.In recent years,there are more and more algorithms for data stream classification,while most of them are based on the assumption that the data distribution is balanced or nearly balanced.However,more and more real life applications such as monitoring system fault diagnosis,network intrusion,credit card fraud,telecommunications management,oil spill detection,text classification,the data distribution is imbalance,where misclassifying the minority class often cause great loss.Therefore,how to improve the precision of classification in minority instances without reducing the accuracy of classification in majority instances is a hot and difficult issue in mining imbalanced data streams.In addition,the concept drift is another difficult problem in the research of data stream classification,especially when the concept drift and imbalance are combined,which makes the data stream classification faced greater challenges.At present,most of the proposed ensemble classification algorithm is based on the idea of data block,just like the sliding window,where the performance is too sensitive to the size of window.What’s more,the general assumption that the drift does not exist in a data chunk,which is not consistent with the real data stream.This paper puts forward the selectively approach with dynamical chunk size for mining imbalanced data stream in nonstationary environment,which will be introduced as follows:(1)The algorithms of SMDC:by adding the concept drift detector to adjust the size of current chunk to get the optimal chunk,which ensure that the instances from current chunk is of the same concept,so as to improve the ability of classifiers.In the drift detector,this paper put forward a detection method applied in imbalanced data streams,which is different from using the overall accuracy.It can not only detect the concept drift in both majority instances and minority instances,but also can not be affected by certain noise.In addition,based on the large data processing idea of selectively remain some minority instances,and under-sampling the majority instances without repeated,which avoid the number of minority instances exceed the number of majority instances,we can train the classifiers well and improve the classification accuracy at the same time.We set the experiments to compare the algorithm with other typical algorithms on different datasets,which proves that the algorithm can achieve higher classification accuracy on imbalanced data stream and have good robustness to frequent and fast drifts.(2)The algorithm of SMDCWE:In order to avoid forgetting the important knowledge of old instances and improve the adaptive ability of the algorithm to different types of concept drift,we add the weighting mechanism,where remain the learned classifiers by voting and avoid.Finally,the experiment on synthetic datasets and real dataset proves that the algorithm can achieve higher classification accuracy on imbalanced data stream and can be more sensitive to the concept drift.
Keywords/Search Tags:imbalanced data streams, concept drift, dynamic chunk, ensemble classifiers, under-sampling
PDF Full Text Request
Related items