Font Size: a A A

Parallel Machine Learning Algorithm For Large-scale Forestry Text Classification Based On Spark

Posted on:2020-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:D Y ShiFull Text:PDF
GTID:2393330575497725Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the gradual integration of new information technology and forestry fields,forestry-related texts had shown the characteristics of large-scale and difficult to be integrated.However,through the related researches,the relevant research process of forestry text classification was inconsistent with the current forestry text field requirements.The shortcomings were mainly manifested in two aspects:1)the classification labels set in the existing classification system were not scientific and the classification algorithm was mostly based on small batch data training,which leads to the poor practical application ability of the classification model.2)The classification algorithms were mostly based on stand-alone environments and lacked the ability to deal with actual large-scale data classification scenarios.This paper intended to combine big data analysis technology with forestry text analysis and established new classification labels.Then the feature weight calculations were performed by using TF-IDF and Word2vec.After that,there realized an XGBoost parallelizatrion algorithm based on Spark computing framework,which was compared with three parallel machine learning algorithms.The results showed that:1)the classification performance of XGBoost and TF-IDF was significantly better than that of the other seven parallel systems;2)the efficiency and accuracy of each algorithm under TF-IDF algorithm was better than Word2vee,Which showed that the features contained in the word vector obtained by using the TF-IDF algorithm in the Internet were more representative of the forestry characteristics;3)the XGBoost algorithm was ran more effectivce than the stand-alone version based on the performance of Spark,which could deal with the problem of classification upon the massive forestry texts.
Keywords/Search Tags:Spark, Chinese text classification, forestry text, machine learning, XGBoost
PDF Full Text Request
Related items