In the big data calculation of mechanical equipment,data skew is always one of the most difficult problems.At present,the types and data structures of mechanical equipment data owned by enterprises are very complex,so it is difficult for traditional relational databases to well support the storage of semi-structured and unstructured data.Moreover,the traditional custom mechanical equipment data visualization method is time-consuming and requires high professional requirements for operators.In addition,the K-means clustering algorithm relies heavily on the determination of the initial center,and the Gaussian mixture clustering algorithm depends on the distribution of data samples,so it is difficult to complete the clustering of the mechanical equipment scheduling maintenance center statively.Aiming at the above problems,this thesis completes the following research:(1)This thesis proposes a Classification Balance Method(CFBM)for data skew resolution.Firstly,SSEM(Sampling Statistics Extraction Method)is proposed on the basis of SSDM(Sampling Statistics Discrimination Method).Then,according to the causes of Spark program skew and the commonly used data operators,Spark data skew is divided into five categories and judged using the two methods defined above.CFBM algorithm is proposed to solve the five types of data skew problems,and the corresponding solutions are proposed respectively.Finally,the CFBM algorithm is tested on a group of data sets,and the experimental results show that the CFBM algorithm has a better optimization effect on the data sets with severe skew.(2)HDCA(Hybrid Decision Clustering Algorithm)is proposed in this thesis based on K-means Algorithm and GMM Algorithm.Firstly,based on the two result sets of Kmeans algorithm and GMM algorithm clustering,the algorithm divides and determines the distribution of clustering categories of HDCA algorithm by calculating the matching degree between the two result sets.Then according to the consistency of the two traditional clustering algorithms,the data samples are divided into deterministic data and disputed data.Then the center points of each category are determined by the deterministic data,and the disputed data are classified by the distance determination method.Finally,a comparative experiment on the accuracy of HDCA algorithm is carried out.The experimental results show that the accuracy of HDCA algorithm is much higher than that of two traditional clustering algorithms,and slightly better than other algorithms studied by researchers.(3)The visual web component library is designed and realized,and a customizable big data platform is realized with the component library as the core,and the big data monitoring business of construction machinery and equipment is realized based on the platform.This thesis extracts and classifies all the elements of web pages,and divides them into six categories,such as regular charts,maps,auxiliary pictures,text,media and custom components.In these six categories,it is divided into several small categories according to the different attributes of each element.The component library adopts MVVM model architecture similar to VUE framework,and realizes real-time synchronization of data among component display,component state tree and component parameter configuration module through monitoring.In this thesis,Hadoop is used as the data storage support,Spark distributed computing engine is used as the computing basis,and Vue front-end framework is used as the technical support to realize a customizable big data platform,and through this platform,the data display of big data monitoring business of construction machinery and equipment is realized. |