| With the continuous development of the Internet,people’s demand for the network is increasing,the number of Internet devices continues to increase,and the services on the devices continue to increase.These services involve various industries such as banking,industrial manufacturing,and autonomous driving.Operation and maintenance is an infrastructure-level technology in the digital world.As the scale of the company’s software and hardware systems becomes more and more complex and diverse,traditional IT operation and maintenance becomes very difficult,and the demand for intelligent operation and maintenance continues to increase.AIOps(Artificial Intelligence for IT Operations),which began to appear in 2016.It combines operation and maintenance with AI to make fault discovery,fault location,fault self-healing,and capacity prediction more reliable and stable.This topic combines the recharge business scenario of the operator,studies the two key technologies of AIOps,namely fault discovery technology and fault location technology,and designs and implements minute-level root cause location.(1)The core of fault discovery technology lies in indicator anomaly detection.This paper uses business gold indicators(business volume,success rate,and delay)to conduct anomaly detection research,collects indicators through Kafka,and uses linear interpolation to fill missing values in indicator data.Time series data is used The inherent properties(such as smoothness,variance,maximum and minimum values,etc.)are used for feature extraction,and the Min-Max(maximum and minimum value standardization)method is used to standardize the extracted features.In this paper,Isolation Forest and Variational Autoencoder are selected as the index anomaly detection training model,and different parameters and networks are tried to select the optimal structure.Then this article analyzes the advantages of Isolation Forest and Variational Autoencoder,and utilizes the high accuracy of Isolation Forest and Due to the good reconstruction ability of variational autoencoder on normal data,an anomaly detection algorithm based on Isolation Forest and Variational Autoencoder(IForest-VAE)is proposed.Through IForest,the abnormal sliding window is detected and eliminated,and normal data is obtained.for optimizing variational autoencoders.Experimental results on real datasets show that the evaluation results of this model are better and more feasible.(2)AIOps fault location technology,first analyzes the module(segment)correlation in the recharge business process,and then studies the KPI curve correlation,using EWAM(Exponentially Weighted Moving-Average)and wavelet analysis for multiple KPI curves and other methods to extract abnormal features,give the calculation method of abnormal score,and obtain abnormal segmentation and examples.Then,the fault propagation chain is constructed by combining the relationship between the modules and the KPI curve.Combined with the fault propagation chain,the performance indicators(host,middleware,etc.)are detected abnormally,and the location and cause of the fault are found.(3)This paper designs and implements a minute-level root cause location system in complex operation and maintenance scenarios with the help of the operator’s top-up service.I participated in and completed the implementation of fault discovery and fault location,including requirements analysis,database design,partial algorithm implementation of AI module,and the design and development of fault location module,monitoring center module,alarm module and other functions. |