For the power dispatch automation system,the storage system is responsible for the preservation of various programs and data.Among them,the disk is the most widely used hardware device.If it fails,the overall service availability of the system will be reduced,and the stored data may even be permanently lost,thus compromising the user experience.At present,the accuracy rate of fault detection by the disk’s own early warning mechanism is low,and it is difficult to meet the requirements of the actual situation.As a relatively stable storage medium,disks have fewer fault data and a wide variety of models.At the same time,they also have the characteristics of high data dimension and complex data distribution.The existing disk failure prediction methods have the following problems:the SMART attribute data of the electric power dispatching automation system disk have the characteristics of highdimensional imbalance,and the existing sampling methods to solve the problem of data imbalance are deeply affected by the disaster of dimensionality.It is difficult to obtain better results on disk data.The number of disks in the power dispatch automation system is not enough,and it is difficult to provide enough fault samples for labeled training.There are many kinds of dimension attributes in the SMART attribute data of disk and some of them are weak correlation attributes.The existing methods are difficult to distinguish irrelevant attributes under unsupervised conditions,which leads to the decline of the prediction ability of the disk failure detection model.When the disks of the power dispatching automation system are replaced with new ones,it is difficult to provide enough data for model training.Most disk migration learning algorithms only use single-source domain data for transfer learning,and the existing multi-source domain transfer learning algorithm only considers the expansion of target domain data and lacks the optimization of the transfer model itself,so it cannot effectively improve the effect of target domain detection.Because of the above problems,this paper conducts research on disk failure prediction methods based on machine learning.The research results are of great significance for accurately predicting disk failures and ensuring the stable operation of the power dispatch automation system.The main work of the paper is as follows:(1)The oversampling method under the condition of unbalanced positive abnormal samples for high-dimensional disk datasets is studied.According to the characteristics of disk SMART attribute data,after preprocessing,a high-dimensional hypersphere oversampling method is proposed because of the imbalance of positive and abnormal samples in disk datasets and the high data dimension.This method obtains the number of samples that need to be balanced by random sampling on the minority class sample set,and on this basis,selects the corresponding nearest neighbor points in the minority class distribution space for each sample in turn through Euclidean distance,and connects two points with a line.The midpoint is the center of the sphere to construct a sampling hypersphere in the hyper-dimensional space.In this area,the required new points of the minority class are randomly generated through the dimensional space distance iteration,and the spatial distribution of the minority class samples is increased based on the rebalance of the class sample data.The effectiveness and advancement of the method proposed are verified by comparative experiments with existing typical oversampling methods on public datasets and disk datasets.(2)Unsupervised disk anomaly detection methods based on feature prediction are studied.Aiming at the problem that the disk sample label of the power dispatch automation system is difficult to obtain,from the perspective of unsupervised anomaly detection,a disk failure degree sorting method based on feature correlation partition regression is proposed,which can alleviate the adverse effects of dataset dimensions and irrelevant attributes on model performance to a certain extent.According to the correlation between the features,the high-dimensional datasets will be divided into multiple feature subspaces.In each subspace,the feature with the highest correlation coefficient will be conducted as a pseudo-label.After that,using the remaining features as the prediction attributes to train a supervised regression prediction model and calculate the anomaly score of each sample in the subspace according to the difference between the regression prediction value and the true value of the pseudo-label.Furthermore,we define a weighting strategy based on the level of correlation in the subspace integration stage to obtain the final anomaly score ranking table.The effectiveness and advancement of the proposed method are verified by comparative experiments with existing typical unsupervised anomaly algorithms on public datasets and disk datasets.(3)The optimization method of the multi-source domain disk migration model is studied.According to the similar characteristics of SMART attribute data distribution of different types of disks from the same manufacturer,in order to solve the problem that enough data samples cannot be provided to establish a reliable model in the early stage of new disks being put into use,a multi-source domain migration model optimization method based on double-layer dynamic weighting is proposed.Based on the idea of sample migration,the algorithm assigns initial weights to all samples in each source domain and the target domain,and then invests in model training.Then update the weights based on the error rate of the target domain samples predicted by the model obtained from the training,and train in continuous iterations to obtain a training model that is biased towards the target domain samples.At the same time,the error rate information of each layer iteration is retained for the ensemble weights of each source domain model.In addition,the update of the two-layer weight can also be used for further learning of the subsequent model.The method also includes an online update function.After setting the threshold,the new sample obtained by disk operation is used as a new target domain to continue training the model.Thus,the dynamic update of each source domain model and the corresponding model weight is completed,and further improve failure prediction accuracy for target domain disks.The effectiveness and advancement of the method proposed in this paper are verified through the comparison before and after migration,before and after the online update and the comparison between the proposed method and existing multi-source domain transfer learning methods. |