| In today’s era of big data,data is generated at high speed and infinity.How to process and store massive data has become a huge challenge.The continuous production of data and the characteristics of high speed make it impossible for human experts to label all samples quickly and accurately.Manually labeling samples is expensive and impractical.Therefore,in the real data stream environment,there are often only a small number of labeled samples and a large number of unlabeled samples.The generalization ability of the model trained only with labeled samples is low,and the rich structural information in unlabeled samples will be wasted.At the same time,in the data stream environment,the distribution of data often changes,that is,it is often accompanied by the phenomenon of concept drift.Traditional machine learning algorithms are based on static independent and identically distributed environment.Therefore,they can not deal with the high-speed infinite data stream environment with concept drift.There are two main challenges in the dynamic semisupervised data stream environment:(1)According to the characteristics of data stream in semi-supervised environment,how to train a component classifier with strong generalization ability by combining the labeled information of a small number of labeled samples and the structural information of a large number of unlabeled samples,and solve the problem of underfitting caused by using only labeled samples to build a component classifier.How to make the constructed component classifier save the internal structure and distribution information of the current data.(2)For the semi-supervised environment of data stream with concept drift,unsupervised concept drift detection and statistical detection have some limitations.How to design an appropriate concept drift detection mechanism to detect concept drift quickly and accurately,and timely build or update component classifiers to adapt to the changes of concept distribution,so as to make the classification accuracy rise rapidly.To sum up,considering the research value and challenges brought by the semisupervised classification of dynamic data stream,the research content of this paper is summarized as follows:Firstly,this paper proposes an algorithm SCLNDT.Firstly,the algorithm uses labeled samples to establish a decision tree model,divides the sample space into multiple regions,and clusters the labeled samples in the leaf nodes according to categories.Then,the cluster information in each leaf node of the component classifier is updated incrementally by unlabeled samples,and the sample information in the cluster is represented by CF feature.The constructed component classifier has good generalization performance and simplified structure,and saves the internal structure and distribution information of the data itself.In this paper,a semi-supervised learning algorithm is designed to expand labeled samples to assist concept detection.Considering that historical concepts may reappear in the future,this algorithm involves the detection of recurring concepts.A large number of experimental results show the effectiveness of the algorithm.The main innovation of SCLNDT algorithm is that this algorithm proposes the construction strategy of semi-supervised component classifier.In order to alleviate the underfitting phenomenon of training model with labeled samples,the algorithm uses unlabeled samples to update the cluster information in the component classifier incrementally.For concept detection in semi-supervised environment,this paper proposes to use the knowledge of effective clusters of historical models and the information of labeled samples to design a semi-supervised learning algorithm to expand labeled samples to assist in concept drift detection.Secondly,based on the semi-supervised data stream environment,this paper proposes a data stream semi-supervised classification algorithm TLSCDT.TLSCDT proposes a semisupervised learning algorithm to expand the number of labeled samples.In this paper,the semi-supervised learning algorithm calculates the threshold through the neighborhood threshold adaptive algorithm,then uses the weight formula to calculate the weight of each historical model,then select historical classifiers by the weight.Then using the structure of the model and labeled samples to label unlabeled sample in recent neighborhood labeled sample,finally using majority voting gives unlabeled sample a pseudo label.When the emsemble pool is full,this algorithm considers maintaining the maximum diversity among component classifiers in the ensemble pool to eliminate component classifiers.For the update strategy of historical model,this algorithm considers the use of model transfer learning.A large number of experimental results show the effectiveness of the algorithm.The main innovation of TLSCDT algorithm is to propose a semi-supervised learning algorithm with adaptive neighborhood threshold to expand the number of labeled samples.For different datasets and data stream environments with changing data distribution,this algorithm can adaptively obtain neighborhood threshold in real time to meet the real-time requirements of semi-supervised data stream. |