Font Size: a A A

Research On Statistical Sampling Method Of Stream Data For Concept Drift

Posted on:2021-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:L LinFull Text:PDF
GTID:1480306500466574Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet and modern industry,various types of data streams emerge,such as network click flow,sensor network flow,E-commerce transaction flow,and navigation data flow.These data streams have the following characteristics: large data volume,high dimension,and real-time,hence they cannot be stored in memory at once,and only can be obtained through a single scan.Data streams pose a new challenge to the computational efficiency of database storage and data mining methods.Therefore,approximation algorithms are usually used in many practical problems to construct a summary data structure on the original data stream to reduce the impact of the amount of data on the solution of the problem.The quality of summary data structure directly affects the result of database query and data mining.Therefore,how to build a high-quality summary data structure is an important topic in the research field of data streams.A good summary data structure should represent the distribution of the overall samples.That is to say,it can obtain almost the same results on such a good summary data structure as the overall samples.Sampling is the main method to generate summary.Due to generality of the summary generated by sampling method,it has become one of the hot research issues in recent years.Most existing sampling methods seldom consider the problems of concept drift,high-dimensional characteristics that cause exchange of the overall sample distribution,and the high complexity of computation and storage.To solve the above problems,this thesis mainly studies the representation of summary data structure based on concept drift and high-dimensional data,and focuses on the statistical sampling problems of mutation,gradual and outlier concept drift.Its main research contents and contributions are as follows:1.Most existing sampling methods only consider the steady state data and ignore concept drift,so concept drift based multi-dimensional data streams sampling method(CDMDSS)is proposed.(1)Sampling strategy of unit division.The data in the reference window is divided into units to obtain frequency statistics of the data in each dimension,and the data is sampled in proportion to the frequency of data in the sliding window.(2)Adaptive strategy.The sliding window detects whether there is concept drift.If the data distribution does not change,the original sampled data is retained as the summary.Otherwise,the current sliding window switches to a reference window to capture the new data distribution.Compared to the simple random sampling method and the sampling methods based on Gaussian distribution,the proposed method captures the real-time distribution of data,and the generated summary is closer to the overall data distribution.Meanwhile,the proposed method is suitable for both discrete data and continuous data.2.The statistical sampling problem of gradual concept drift is studied.A hybrid statistical sampling algorithm based on dynamic feature selection and feature retention is proposed to solve the problems of high computational complexity and high storage cost in high-dimensional data streams.(1)Importance feature of dynamic selection strategy.According to the change of sliding window,the importance feature is selected by using low rank approximation of matrix.(2)Important feature consistency strategy.According to the constraint that the distribution represented by the important features is consistent with the distribution represented by the original features,the selected feature subset is unchanged in the current window,and the samples of high quality is sampled as the summary.The proposed method only makes frequency statistics on important features to eliminate redundant calculation and improve efficiency of sampling.When concept drift occurs,the proposed method can also obtain high-quality representative samples.Moreover,the proposed method adopts a first-in first-out queue data structure to limit the amount of memory space with time.3.The statistical sampling of data streams driven by anomaly detection task is studied.Outliers makes up a small proportion in the total samples,but they follow a certain distribution,which can be considered as the phenomenon of concept drift.Therefore,aiming at the problem that outliers are difficult to catch in the anomaly detection task-driven data streams,a summary generation method with emphasis on outlier sampling is proposed to detect anomalies in time and give corresponding treatment.The proposed method first divides the unbalanced datasets into high-density and low-density clusters,where the data in high-density clusters are randomly sampled and the data in low-density clusters are sampled through relative distance measurement.Second,by using the high-density cluster first substitution strategy,a new data is processed when the space allocated for summary data is used up.Finally,the experimental results show that the proposed method is superior to some classical anomaly detection algorithms in terms of the representability of outliers on real datasets.
Keywords/Search Tags:Data Stream Mining, Summary, Statistical Sampling, Concept Drift, Density Estimation
PDF Full Text Request
Related items