| The rapid growth of the Internet has provided an inexhaustible source of information for people.In the face of massive information resources,how to use automated methods to quickly and accurately extract useful information from users of massive texts.It has become one of the research focuses in the field of natural language processing.The rapid change of cloud computing brings advantages and friendly conditions for distributed storage and mining analysis of massive Chinese text data.The storage of the Chinese computing system is an HDFS file system.The HDFS file system has a series of characteristics such as high throughput and good fault tolerance,which is also consistent with the requirements of big data mining analysis.This study selects Spark as the platform for data analysis and processing.Because Spark has the advantages of both Apache Hadoop and MapReduce.Spark has its memorybased computing engine framework,this has more comparisons with iterative calculations and more common machine learning algorithm operations.High efficiency advantages.In this study.Naive Bayes(NB)and Logistic Regression Analysis(LR)were used for parallel Chinese text classification.The NB algorithm was optimized to establish its own optimized classification model which was finally used on the Spark platform.Parallel optimization implementation.The main work includes:a series of preprocessing work for the characteristics of massive data.This experiment has established an improved text classification TNBIF model for the characteristics of massive text data.Firstly,the text data is removed from the noise information,the sentence is segmented,the word segmentation and part of speech tagging processing are performed.The data is cleaned and filtered to retain only the words with the specified part of speech(such as nouns,verbs,adjectives)which are candidate keywords and their constituent candidates Keyword map.This study selects the TextRank algorithm for Chinese feature extraction Chinese keywords which can achieve the purpose of reducing the dimension.Aiming at the improved optimization of the Chinese feature word posterior probability calculation method in the naive Bayesian classification algorithm.It overcomes the proportion that the ordinary Bayes does not consider the number of certain feature words in the whole data set.The number of documents of a word has the proportion of the document of the feature word in all data sets.This experiment introduces a weight value or an influence coefficient.For the improved text classification model TNBIF algorithm model.The Spark cluster environment is used to optimize the implementation under the distributed framework.In this study,we designed and implemented a naive Bayesian algorithm and a logistic regression algorithm for parallelization improvement of large scale Chinese text classification.In this experiment,Fudan University corpus and Sogou corpus are selected.In the preprocessing stage.The two text corpora select TextRank algorithm to extract Chinese keywords and dimension reduction.This experiment uses the internal python program for sequential text preprocessing.This experiment applies word weighted parallel computing namely TF-IDF.Finally,we propose parallel classification in Spark based NB and LR.NB,LR makes the Chinese text corpus accurate to 93% which makes the accuracy of the same corpus increase by 4.03% on average.Finally,the experiment tests the best influence coefficient,classification accuracy,recall rate F value and time performance and acceleration ratio in the Spark cluster.The experimental results show that the TNBIF model is better than other classification algorithms.This model is in Spark.The performance on the platform is more obvious. |