Research On Stream Data Classification Algorithm Mining Based On Spark

Posted on:2019-12-30

Degree:Master

Type:Thesis

Country:China

Candidate:R Zhuang

Full Text:PDF

GTID:2428330566995991

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of Internet,facing real-time arrival,continuous and unlimited stream data,traditional data mining algorithms have been unable to meet the needs of mining.Stream data mining algorithm has become a hot research topic.This thesis focuses on the classification mining algorithm for stream data.In order to improve the efficiency and performance of stream data classification mining algorithm,this thesis not only improves the existing classification algorithms,but also deploys the improved algorithm to the big data processing platform Spark for its parallel implementation.In order to improve the classification and mining efficiency of the stream data,CVFDT(concept-adapting fast decision tree)algorithm is parallelized among attributes,and the parallelized implementation scheme of CVFDT based on Spark is designed according to its stream computing mechanism.The results of implementing the scheme on Spark show that the classification efficiency of CVFDT algorithm in Spark cluster environment is significantly improved compared with that in stand-alone environment.The improved parallel CVFDT algorithm has good adaptability to large-scale stream data processing.In order to improve the ability of CVFDT algorithm in handling smooth data stream with continuous attributes,two improvements have been made to the CVFDT algorithm: using the multiple Delta method instead of Hoeffding boundary computation and designing a more efficient and accurate method of continuous attribute segmentation and weighing recalculation.Thus,a CVFDT algorithm for continuous attributes is formed,named C-CVFDT.Furthermore,a parallel implementation scheme of C-CVFDT algorithm based on Spark is designed.The results of algorithm implementation based on this scheme and test experiments show that C-CVFDT algorithm has better prediction accuracy and classification efficiency for stream data containing continuous attributes.Aiming at the problem that CVFDT algorithm cannot deal with unstable data effectively,according to the idea of integrated classifier,a concept-adaptive integrated classification algorithm for unstable stream data is designed by integrating the basic classifier of CVFDT algorithm and the basic classifier of Native Bayesian algorithm,which named ECA(Ensemble Classification Algorithm).The core idea of ECA is to use the CVFDT classification method and the Naive Bayes classification method to learn and form an integrated classifier,and then the new base classifier is used to optimize the integrated classifier when the concept drifts make classifier's accuracy drop to a set threshold,which can adapt to the new stream data.The experimental results show that the ECA algorithm has strong ability to adapt to the concept drift.The stream data classification algorithms and their parallelized implementation schemes based on Spark can be suitable for the characters of infinity,rapidity and real-time of stream data.The research content of this thesis is relatively advanced,and the research results have certain theoretical value and good practicability.

Keywords/Search Tags:

stream data, classification, CVFDT, ensemble classification

PDF Full Text Request

Related items

1	Research On Data Stream Classification Algorithm Based On Ensemble Learning
2	Research On The Classification Methods For Dynamic Data Stream
3	Research On Data Stream Classification Algorithm With Limited Amount Of Labeled Data
4	Research On Single And Multi Label Data Stream Classification Based On Ensemble Category
5	Research On Subspace Ensemble Learning
6	Research On Data Stream Classification Algorithm Based On Ensemble Learning
7	Technical Data Stream Flow Recognition Mining
8	Research On Hybrid Ensemble Model Based Data Stream Classification With Unlabeled Data
9	The Research On Massive And Dynamic Data Stream Classification Method
10	Research On Concept Drift Data Stream Classification Based On Ensemble Learning