| Support Vector Machine algorithm as a classification algorithm in the field of machine learning,it takes structural risk minimization as the principle,training error minimization as the constraint,confidence risk minimization as the optimization purpose,by calculating the kernel matrix and solving the quadratic programming function to seek support vectors,compatible with linear and nonlinear problems,can well overcome dimensional disasters,with strong generalization ability and robustness.Although the support vector machine algorithm has good classification performance,it involves high spatiotemporal complexity problems such as quadratic programming function solving and kernel matrix storage calculation.Especially with the advent of the era of big data,the increasing scale of data makes the performance bottleneck of support vector machine algorithms more and more obviousThe proposal of distributed frameworks for parallel processing of large-scale data,such as Map Reduce,and its wide application in the industry,have brought the solution idea of parallel support vector machine algorithm to how to break through the performance bottleneck of support vector machine algorithms in big data environment.In the face of the increasing amount of massive data in the era of big data,how to solve support vector machine computing tasks in parallel by combining effective distributed frameworks,how to reduce the overhead of parallel support vector machine algorithms,and how to ensure the classification performance of parallel support vector machine algorithms need to be solved urgently.Based on this,the main work of this paper is as follows:Aiming at the problems of large subset distribution bias,low parallel efficiency and inaccurate filtering of non-support vectors in parallel support vector machine algorithms in big data environment,RC-PSVM,a parallel support vector machine algorithm based on relative entropy and cosine similarity is proposed.Firstly,the data division strategy DPRE based on relative entropy is proposed,which balances the relative entropy of the current subset and the original data set,and reduces the subset distribution bias.Then,a redundancy level detection strategy CS-RLDS based on cosine similarity is proposed,which calculates the cosine similarity of normal vectors between local SVMs in adjacent layers,identifies and stops redundancy layers,and improves parallel efficiency.Finally,the non-support vector filtering strategy NSVF is proposed,and the support vector similarity is calculated to identify the non-support vector,which solves the problem of inaccuracy of filtering non-support vectors.Experiments show that the RC-PSVM algorithm has a better classification effect and runs more efficiently under big data.Aiming at the problems of data redundancy,load imbalance and unidentified region of parallel multi-classification support vector machine algorithm in big data environment,KPSVM based parallel multi-classification support vector machine algorithm based on K-means is proposed.Firstly,the projection-based PRDR strategy of deleting redundant data is proposed,which obtains the cluster center by clustering each category in parallel,measures the projection of the center point vector between different clusters,and accurately identifies and deletes redundant data.Then,a load-balanced parallel training strategy based on estimation is proposed PTLBP,which distributes each training set evenly to each parallel node by estimating the support vector content of each training set.Finally,a multi-classification model construction strategy based on class approximation is proposed,and the sample to be identified is used as the original cluster,the approximation degree of the original cluster and the adjacent cluster is calculated,and a prediction model is constructed according to the approximation degree to accurately predict the unidentifiable area and solve the problem of unrecognizable area.Experiments show that the K-PSVM algorithm performs better when dealing with multiclassification problems of large datasets. |