Font Size: a A A

Research On Disk Failure Prediction In Data Centers

Posted on:2021-02-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:T M JiangFull Text:PDF
GTID:1488306107455744Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the Internet era,the rapid growth of data scale has brought huge challenges to the storage.With the advantages of large capacity and low price,disk is widely used in data center storage.However,disk belongs to complex mechanical and electronic equipment,so it is very challenging to maintain its high reliability.Disk failure prediction technology predicts the impending disk failures,so as to actively migrate the data of these disks before they fail,thus improving the system reliability and reducing the maintenance cost.However,there are still some problems to be solved:(1)the lack of failure disk samples leads to the limited applicability of the disk failure prediction methods which are based on supervised classification model;(2)only the prediction accuracy is used to measure the quality of the prediction method,there is no evaluation of the cost of mis-predictions;(3)based on sector error prediction,increasing the scrubbing frequency for disks with latent sector error leads to higher maintenance cost.In view of the above three problems,the main work includes the following three aspects:Firstly,for the applicability limit of the disk failure prediction methods which are based on supervised classification model,a disk failure prediction method SPA based on anomaly detection model is proposed.SPA regards the failure disk samples as exceptions and only uses the healthy disk samples to train the model,thus solving the model cold-starting problem.In addition,by constructing a two-dimensional SMART data image-like representation,combined with deep neural network,SMART data features can be automatically mined.At the same time,the model updating is realized by using the fine-tune feature of deep neural network,thus solving the model aging problem.The experimental results based on the real-world data set of Backblaze show that SPA can achieve 1%false positive rate and 99%failure detection rate in the whole life cycle of disks.Experimental results demonstrate that anomaly detection based SPA can overcome the applicability limit of existing failure prediction methods.Secondly,for the lack of mis-prediction cost evaluation metric,a mis-prediction cost optimization method VCM is proposed for disk failure prediction.From the perspective of reducing the cost of reliability maintenance,VCM introduces the cost of mis-predictions into disk failure prediction,and reduces the cost of mis-predictions through cost-sensitive learning.Specifically,VCM assigns different cost weights to false positives and false positives,and constructs a loss function for cost-sensitive learning.Then,a threshold-moving strategy is used to select the prediction threshold of the lowest cost.The experimental results based on the Backblaze and Baidu real-world data set show that,compared with cost-blinded methods,VCM can reduce the mis-prediction cost by up to 22%.Experimental results demonstrate that cost-sensitive learning is effective in reducing the mis-prediction cost.Finally,for the problem that scrubbing methods which are based on sector error prediction leads to increases of scrubbing cost,an adaptive scrubbing method FAS is proposed.Based on the results of sector error prediction,FAS improves the scrubbing frequency for the disks with sector fault,and reduces the scrubbing frequency healthy ones.In addition,considering the periodic fact of scrubbing,a mapping method based on voting is introduced to map the sample level prediction results to the disk level prediction results.The experimental results based on the Backblaze real-world data set show that,compared with the state-of-the-art scrubbing method,FAS can achieve the same data reliability as the former,and reduce the scrubbing cost by up to 32%.Experimental results demonstrate that the sector error prediction is effective in reducing the scrubbing cost and improving the data reliability.
Keywords/Search Tags:data centers, hard disk, reliability, failure prediction, machine learning
PDF Full Text Request
Related items