Font Size: a A A

Research On Semi-Supervised Classification Algorithm For Concept Drift Data Streams Based On Model Reuse

Posted on:2024-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:W KangFull Text:PDF
GTID:2568307157982299Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of technology,a large amount of data is generated in the form of data streams from real computing devices at high speed and infinitely.How to process these massive amounts of data has become a huge challenge.In actual data stream classification scenarios,due to factors such as annotation cost,data volume,and data storage,there are usually relatively few labeled samples available for learning,and most of the samples are unlabeled.And the distribution of data in the data stream often changes over time,which is called concept drift,which makes traditional machine learning algorithms difficult to adapt to this situation.Therefore,the main challenges in the concept drift semi-supervised data flow environment are:(1)how to mine unlabeled and labeled sample information in data chunks and train a robust semi-supervised classification model;(2)The existing working classifier update strategy cannot immediately retain classifiers belonging to different concepts while retaining as many classifiers as possible that are consistent with the current concept.Moreover,the incremental update method of component classifiers in the classifier pool can lead to the model containing multiple concepts,resulting in poor model robustness;(3)Solve the problem of continuously reinitializing the classifier due to the detection of concept drift,and using clustering assumptions to detect recurring concepts is a very timeconsuming task.The research content of this article is summarized in the following two aspects.Firstly,propose a new clustering model reuse semi-supervised classification algorithm for data streams.Firstly,the data stream arrives in the form of data chunks,and after classifying the data chunks,a clustering model with adaptive determination of the number of clusters is trained.Secondly,by calculating the similarity between each component classifier in the classifier pool and the clustering model,multiple component classifiers are selected.Once again,reuse the selected component classifier with the current data chunk and integrate it with the clustering model.Then,the classifier pool is divided into old and new replacement and diversity maximization classifier pools for updating.Finally,integrate and classify the samples of the next data chunk.The experimental results on multiple artificial and real datasets show that the algorithm can effectively adapt to concept drift,with significant improvements compared to existing methods.Secondly,propose a semi-supervised classification algorithm for reproducing concept drift data streams through model reuse.Firstly,the labeled sample set in the data chunk is used to initialize the classification model.Secondly,after detecting concept drift during the data iteration process,the conformal prediction output of the model and corresponding unlabeled samples is added to the classifier pool and a new model is reconstructed.Then,the component classifier of the detection classifier pool is used to detect the conformal prediction output that is similar to the current data chunk,and the recurrence concept is detected through a distribution-based method.Finally,update the classifier and incremental update model based on the concept drift detection results.The algorithm was tested on multiple synthetic and real datasets,and its cumulative accuracy and chunk accuracy under different labeling ratios reflect the effectiveness of the proposed algorithm.The innovation of this article is:(1)proposing a concept management method for dual classifier pools,dynamically selecting multiple similar component classifiers from them,and reusing the clustering model of the selected component classifiers using the current data chunk,which can effectively adapt to concept drift and improve the classification performance of the algorithm.(2)Combining conformal prediction methods for detecting concept drift in reproduction.Save classifiers trained with different concept data in the classifier pool.After detecting concept drift,compare the conformal prediction output of the component classifier in the classifier pool with the conformal prediction output of the data chunk to detect the recurrence of concepts,and then update the model and classifier pool.
Keywords/Search Tags:data stream, semi-supervised learning, ensemble learning, model reuse
PDF Full Text Request
Related items