| With the development of large-scale data processing technology, people want to get useful information from the massive data increasingly strong demand, some of the small sample sets on the outstanding performance of the machine learning algorithm, is gradually introduced into the large data processing scene. Therefore, how to efficiently parallelize the machine learning algorithms on large-scale data has become the focus of attention of researchers in recent years.Support Vector Machine (SVM) is a machine learning method based on statistical learning VC theory and structural risk minimization theory. It has many advantages over other machine learning algorithms for small sample set, nonlinear data and high dimensional pattern recognition. However, when the support vector machine is applied to large data sets, it is difficult to make good use of the algorithm because of its high computational complexity and long running time. Therefore, in this paper, the parallel optimization of support vector machine(SVM) algorithm is studied for large-scale data environment. In this thesis, we use Spark, which is a very popular parallel computing framework, as the implementation tool of parallel support vector machine.Based on the Spark platform, this paper uses the indexedRDD developed by the University of California at Berkeley to realize the parallelization of the P-pack SVM. In view of the limitation of the model, the BPPGD algorithm is put forward in this paper. Experimental results show that the BPPGD algorithm proposed in this paper has higher classification accuracy and faster execution speed than the P-pack SVM algorithm in large-scale data.Cascade SVM proposed is a multi-level model training method for distributed system design. The last stage of the algorithm can only be run on a single machine, which limits the overall efficiency of model training, resulting in a longer algorithm run time. In this paper, the Cascade SVM algorithm is implemented on the Spark platform, and the CSP-SVM algorithm is proposed for its shortcomings and advantages of the P-pack SVM algorithm. Kernel SVM can make full use of the advantages of parallel distributed system, improve its training speed,and effectively ensure the correctness of classification.Finally, based on the large data analysis platform BDAP developed by the communication software engineering center of Beijing University of Posts and Telecommunications, the integration process of the parallel Kernel SVM on the platform is described. And uses the text data to carry on the performance test to the above two improved algorithm. |