Font Size: a A A

Research On Stream Data Classification Algorithm Mining Based On Spark

Posted on:2019-12-30Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhuangFull Text:PDF
GTID:2428330566995991Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet,facing real-time arrival,continuous and unlimited stream data,traditional data mining algorithms have been unable to meet the needs of mining.Stream data mining algorithm has become a hot research topic.This thesis focuses on the classification mining algorithm for stream data.In order to improve the efficiency and performance of stream data classification mining algorithm,this thesis not only improves the existing classification algorithms,but also deploys the improved algorithm to the big data processing platform Spark for its parallel implementation.In order to improve the classification and mining efficiency of the stream data,CVFDT(concept-adapting fast decision tree)algorithm is parallelized among attributes,and the parallelized implementation scheme of CVFDT based on Spark is designed according to its stream computing mechanism.The results of implementing the scheme on Spark show that the classification efficiency of CVFDT algorithm in Spark cluster environment is significantly improved compared with that in stand-alone environment.The improved parallel CVFDT algorithm has good adaptability to large-scale stream data processing.In order to improve the ability of CVFDT algorithm in handling smooth data stream with continuous attributes,two improvements have been made to the CVFDT algorithm: using the multiple Delta method instead of Hoeffding boundary computation and designing a more efficient and accurate method of continuous attribute segmentation and weighing recalculation.Thus,a CVFDT algorithm for continuous attributes is formed,named C-CVFDT.Furthermore,a parallel implementation scheme of C-CVFDT algorithm based on Spark is designed.The results of algorithm implementation based on this scheme and test experiments show that C-CVFDT algorithm has better prediction accuracy and classification efficiency for stream data containing continuous attributes.Aiming at the problem that CVFDT algorithm cannot deal with unstable data effectively,according to the idea of integrated classifier,a concept-adaptive integrated classification algorithm for unstable stream data is designed by integrating the basic classifier of CVFDT algorithm and the basic classifier of Native Bayesian algorithm,which named ECA(Ensemble Classification Algorithm).The core idea of ECA is to use the CVFDT classification method and the Naive Bayes classification method to learn and form an integrated classifier,and then the new base classifier is used to optimize the integrated classifier when the concept drifts make classifier's accuracy drop to a set threshold,which can adapt to the new stream data.The experimental results show that the ECA algorithm has strong ability to adapt to the concept drift.The stream data classification algorithms and their parallelized implementation schemes based on Spark can be suitable for the characters of infinity,rapidity and real-time of stream data.The research content of this thesis is relatively advanced,and the research results have certain theoretical value and good practicability.
Keywords/Search Tags:stream data, classification, CVFDT, ensemble classification
PDF Full Text Request
Related items