Research On Disk Failure Prediction In Data Centers

Posted on:2021-02-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:T M Jiang

Full Text:PDF

GTID:1488306107455744

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the advent of the Internet era,the rapid growth of data scale has brought huge challenges to the storage.With the advantages of large capacity and low price,disk is widely used in data center storage.However,disk belongs to complex mechanical and electronic equipment,so it is very challenging to maintain its high reliability.Disk failure prediction technology predicts the impending disk failures,so as to actively migrate the data of these disks before they fail,thus improving the system reliability and reducing the maintenance cost.However,there are still some problems to be solved:(1)the lack of failure disk samples leads to the limited applicability of the disk failure prediction methods which are based on supervised classification model;(2)only the prediction accuracy is used to measure the quality of the prediction method,there is no evaluation of the cost of mis-predictions;(3)based on sector error prediction,increasing the scrubbing frequency for disks with latent sector error leads to higher maintenance cost.In view of the above three problems,the main work includes the following three aspects:Firstly,for the applicability limit of the disk failure prediction methods which are based on supervised classification model,a disk failure prediction method SPA based on anomaly detection model is proposed.SPA regards the failure disk samples as exceptions and only uses the healthy disk samples to train the model,thus solving the model cold-starting problem.In addition,by constructing a two-dimensional SMART data image-like representation,combined with deep neural network,SMART data features can be automatically mined.At the same time,the model updating is realized by using the fine-tune feature of deep neural network,thus solving the model aging problem.The experimental results based on the real-world data set of Backblaze show that SPA can achieve 1%false positive rate and 99%failure detection rate in the whole life cycle of disks.Experimental results demonstrate that anomaly detection based SPA can overcome the applicability limit of existing failure prediction methods.Secondly,for the lack of mis-prediction cost evaluation metric,a mis-prediction cost optimization method VCM is proposed for disk failure prediction.From the perspective of reducing the cost of reliability maintenance,VCM introduces the cost of mis-predictions into disk failure prediction,and reduces the cost of mis-predictions through cost-sensitive learning.Specifically,VCM assigns different cost weights to false positives and false positives,and constructs a loss function for cost-sensitive learning.Then,a threshold-moving strategy is used to select the prediction threshold of the lowest cost.The experimental results based on the Backblaze and Baidu real-world data set show that,compared with cost-blinded methods,VCM can reduce the mis-prediction cost by up to 22%.Experimental results demonstrate that cost-sensitive learning is effective in reducing the mis-prediction cost.Finally,for the problem that scrubbing methods which are based on sector error prediction leads to increases of scrubbing cost,an adaptive scrubbing method FAS is proposed.Based on the results of sector error prediction,FAS improves the scrubbing frequency for the disks with sector fault,and reduces the scrubbing frequency healthy ones.In addition,considering the periodic fact of scrubbing,a mapping method based on voting is introduced to map the sample level prediction results to the disk level prediction results.The experimental results based on the Backblaze real-world data set show that,compared with the state-of-the-art scrubbing method,FAS can achieve the same data reliability as the former,and reduce the scrubbing cost by up to 32%.Experimental results demonstrate that the sector error prediction is effective in reducing the scrubbing cost and improving the data reliability.

Keywords/Search Tags:

data centers, hard disk, reliability, failure prediction, machine learning

PDF Full Text Request

Related items

1	Disk Failure Prediction In Data Centers Via Online Learning
2	Analysis And Research Of Disk Failures In Data Centers
3	Research On Hard Disk Failure Prediction Method Based On Improved Random Forest Algorithm
4	Research On Hard Disk Fault Prediction Technology In Massive Data Storage System
5	Research On Hard Disk Failure Prediction Technology Based On Deep Learning
6	Design And Implementation Of Disk Failure Prediction System Based On Machine Learning
7	Predicting Disk Failures For Large-scale Datacenter By Machine-learning Method
8	Research On Method And Application For Failure Prediction Of Heterogeneous Disks In Large Data Center
9	Research On Disk Failure Prediction Method Based On Multi-dimensional Features
10	Research On Method For Hard Drive Failure Prediction In Massive Storage System