Font Size: a A A

Research On Parallel Text Classification Method Based On Support Vector Machine

Posted on:2020-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z F FengFull Text:PDF
GTID:2428330575488559Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Compared with other data types,text data have less network resources and are easier to upload and download.This makes most of the data information in the network resources exist in text form.Because the Internet is closely linked to people's lives,the impact of the Internet on people's lives is also growing.How to quickly analyze network views,predict network sentiment and correctly guide network public opinion in mass data has become an urgent problem for people today.Text data classification technology is one of the key technologies to address this problem."How to classify text data accurately,quickly and in real time?" It has always been a hot research topic in text data classification.Aiming at the problem of text classification,the SPO-SVM method of streamlining training data set is proposed,and the text data classification method based on support vector machine is given.The main contents include the preprocessing of text data.The training data set simplification of SPO-SVM and the classification method of text data set.Text data are preprocessing mainly includes word segmentation,feature word extraction,and text vectorization.The feature word vector is formed after the word segmentation and the feature word are extracted,and the feature word vector has a definite class discrimination degree between the plurality of category texts.After the text is vectorized,the quantized training sample set file is output,which can satisfy the data format required for the support vector machine training.The SPO-SVM method of the reduced data set is a way of streamlining the support vector machine training data set,and the training data is divided into regions by the method of hypersphere division.The data in the region are classified into a set of data.The SPO-SVM algorithm design is given by using the sample category and the same as the judgment condition of whether to reduce the data.The text classification method mainly includes three steps: training data set reduction,training text classifier and classifier accuracy test.The effectiveness of the algorithm is verified by testing on multiple data sets.Parallel computing can effectively improve the computational efficiency,and a parallel computing method based on support vector machine text classification method is designed.A four-node virtual machine is used to build a big data computing platform composed of Spark and other components.The text data cleaning method is designed based on HDFS and Hive components.A four-node virtual machine is used to build a big data computing platform composed of Spark and other components.The text data cleaning method is designed based on HDFS and Hive components.The SPO-SVM algorithm is applied to the Spark parallel computing framework to further improve the parallel computing efficiency of the algorithm and verify the effectiveness of the algorithm.The support vector machine is used to classify the small sample data to improve the accuracy of text classification.The method of streamlining the training data set is used to improve the training efficiency.The parallel computing big data platform is used to improve the computational efficiency of the algorithm.The experimental data are 10 categories of documents in the Sogou corpus,and the number of documents in a single category is 8000.To further verify the feasibility of this topic,the UCI website is used to provide multiple standard data sets to test the feasibility and effectiveness of the SPO-SVM algorithm.The experimental results show that the training speed of the classifier model is significantly improved,and the prediction accuracy is consistent with the standard support vector machine.Through the test and analysis of the parallel algorithm of text classification,the test results show that the parallel computing mode of SPO-SVM algorithm can greatly improve the speed of classifier model training and unknown text prediction.
Keywords/Search Tags:text categorization, support vector machines, parallel computing, classifier, cypersphere division
PDF Full Text Request
Related items