Font Size: a A A

Research On Fault Root Cause Location And Prediction Mechanism In Large Scale Network Systems

Posted on:2024-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:S C XuFull Text:PDF
GTID:2568307136495594Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In large-scale network systems,as the network size continues to expand,manual management and maintenance has become inadequate for this task.Automated operations and maintenance can use automated tools and algorithms to achieve automatic deployment,configuration,monitoring,security checks,fault diagnosis,and other operations.This reduces errors and mistakes caused by human operation,ensures system security and stability,and greatly improves efficiency and reliability.Meanwhile,with the continuous development of artificial intelligence and machine learning technology,intelligent operations and maintenance(i.e.,AIOps)emerged.AIOps can automatically collect and analyze a large amount of IT data and utilize various algorithms for quick analysis,learning,and predictions,greatly improving the level and efficiency of automated operations and maintenance.In the field of intelligent operations and maintenance,the tasks of locating the root cause of network node failures and predicting disk failures have received much attention and have been the focus of research.In a complex cloud network environment,thousands of network nodes generate a large amount of network operation information every day.When a node fails,it usually causes abnormalities in other nodes connected to it,resulting in a large number of alarms that obscure the real underlying cause and even lead to network paralysis.To ensure stable service operations,these alarms need to be analyzed and processed.However,behind the large-scale cloud network environment,storage,backup logs,and data cannot be separated from the use of large hard drives.Currently,disks are widely used as the general and main storage devices in modern large-scale storage systems in data centers.In this type of cloud network environment,ensuring high availability and reliability of data center management is a challenging task due to various disk failures that occur on site.Therefore,this thesis focuses on the research of network node fault cause location and hardware-level disk failure prediction,specifically including the following three aspects:(1)The existing node fault root cause locating methods have some deficiencies,such as difficult maintenance and updating of domain knowledge,low accuracy of fault dependency graphs,untimely fault analysis and locating,weak generalization ability of models due to data imbalance,etc.In this paper,a data-driven Generative Adversarial Network(GAN)model named TLS-WGAN-GP is proposed for fault root cause locating.The model fits and learns the distribution of less-class data by using a GAN,effectively reducing the possibility of overlap between generated less-class samples and other samples and fitting the distribution of high-dimensional data well.Innovatively,a threelayer subnet is used in the generator to obtain the original features and data distribution in latent space of root cause data,generating higher quality data.Moreover,gradient penalty is introduced into the loss function of discriminator to overcome difficulties and instability in training structured data,improving classification performance and generalization ability of the model.Experimental results show that the proposed TLS-WGAN-GP model can improve the F1 Score from 95% to 98%,which means that the TLS-WGAN-GP model can effectively locate the intelligent root causes of data-driven network node faults,promoting the development of root cause locating technology in intelligent operation and maintenance.(2)The existing disk failure prediction methods have some deficiencies,such as neglecting nonfixed-length time series features,low prediction accuracy caused by uncertain actual time of disk failure in SMART degraded data,and weak generalization ability of models caused by data imbalance.Therefore,this paper proposes a Convolutional Transformer model Conv Trans-TPS for disk failure prediction in large-scale cloud data centers.The model uses a multi-head self-attention mechanism and multi-layer encoder-decoder models to learn and obtain the dependency relationship between time series data,and establish the time correlation between data with different time steps.At the same time,it uses convolutional projection to replace the existing position-based linear projection for attention calculation operation and embeds queries,keys,and values with convolutional projection to enhance the focus on local contextual information.To deal with the imbalance of time series data,the time progressive sampling(TPS)method is used for data augmentation.Experimental results show that the proposed Conv Trans-TPS model achieves an F1 Score of 0.96 and a Matthews correlation coefficient(MCC)of 0.92 on the Backblaze dataset.Our proposed method outperforms the popular CNN-LSTM models by 4% and 5% in terms of F1 and MCC,respectively,improving prediction accuracy and performing better than other comparison models,achieving satisfactory performance.(3)Based on the above research,this paper designs and implements an intelligent operation and maintenance platform.This platform takes TLS-WGAN-GP and Conv Trans-TPS as the core algorithms and mainly provides node fault root cause locating and disk failure prediction functions.Through Web technology and innovative artificial intelligence algorithms,the platform visualizes the results,accurately locates the fault root causes of nodes,and predicts potential disk failures in the future,verifying the effectiveness and accuracy of TLS-WGAN-GP and Conv Trans-TPS algorithms.The platform provides more intelligent,quick and accurate operation and maintenance services for operation and maintenance personnel.
Keywords/Search Tags:fault root cause location, disk fault prediction, data imbalance, generative adversarial network, intelligent operation and maintenance
PDF Full Text Request
Related items