| In recent years,with carbon peaking and carbon neutrality goals and the strong support of national policies,the renewable energy power generation industries,represented by wind power,have been developing vigorously,and technological innovation has been carried out continuously.Using machine learning and large data analysis and other technologies to analyze and process wind turbine operational data,it can mine useful information,and provide reliable data support for subsequent work such as wind turbine power prediction,condition monitoring,fault diagnosis and others,so as to improve the wind turbine operational reliability and economy.However,the actual wind turbine operations are complex,and due to the time-varying operating conditions of wind turbines,the high failure rate of turbine components caused by the harsh operating environment and artificially power curtailment,the actual wind turbine SC AD A data contain a certain proportion of abnormal data and the abnormalities are complex in nature.In this paper,the historical operational data set of the wind turbine named E17 from a wind farm in Inner Mongolia is taken as an example,and the clustering algorithm based on the mixture model is used to effectively cluster the data points and accurately identify the abnormal data,thus achieving the preprocessing of wind turbine operational data.The main research contents of the subject are as follows:(1)Wind speed and power data collected by wind turbine SC AD A system are crucial for turbine operation and maintenance.Taking the test turbine E17 data as an example,this paper deeply analyzes the distribution characteristics of wind speed-power data points in the coordinate system,and summarizes the main types of abnormal data and reasons for their generation.The clustering algorithms belonging to machine learning are briefly outlined,and the principles,advantages and disadvantages of common clustering algorithms are described.(2)Aiming at the complexity of wind turbine data distribution characteristics,a data preprocessing method using a Dirichlet Process Gaussian Mixture Model(DPGMM)based on variable Bayesian inference is proposed.The mixture model can adaptively determine the optimal number of Gaussian components according to the data distribution,overcoming the limitation of the traditional Gaussian Mixture Model that the number of components needs to be determined manually.The test turbine E17 operational data points are allocated into corresponding power bins created in the horizontal power direction with a certain interval,the DPGMM is used to cluster the data points in each horizontal power bin.Combined with the Gaussian components confidence ellipse parameters and prior experience of data distribution characteristics,E17 abnormal Gaussian components and their clustering,abnormal data are identified and labeled accurately.Using actual SC AD A data of the test turbine E17 and another turbine LY16,the proposed method is demonstrated to be effective and general.(3)The concentration parameter is an important hyperparameter of the DPGMM,which can affect the model clustering results.This paper explores the mathematical relationship between the concentration parameter value and the number of Gaussian components,and uses Kernel Density Estimation to convert the data distribution characteristics into digital probability density curves,providing guidance for the concentration parameter value based on the wind turbine data scattering characteristics and the expected clustering effects,so that the DPGMM can be better suitable for the large data analysis in the wind power field. |