| Fault delimitation is an important topic in the field of intelligent operation and maintenance.A good fault delimitation method can quickly and accurately detect the time when an exception occurs,and accurately locate the failure of a node and an indicator of a specific service,guide subsequent emergency measures,help SRE recover the failure in time,and reduce the loss caused by the exception.Therefore,this paper focuses on fault delimitation.According to the process of fault delimitation,it makes an in-depth study on anomaly detection and root cause location:(1)For the monitoring indicators of microservice operation and maintenance,the anomaly detection algorithm of time series is studied in depth.In general,normal and abnormal labels of time series are difficult to obtain,and prior knowledge is lacking;The abnormal performance scenarios of time series are rich,and the stability of the system requires high real-time performance of the algorithm.This paper proposes a multi-dimensional online unsupervised anomaly detection algorithm based on the gold index of the overall stability of the system.The distribution uses KDE algorithm to detect exceptions in the time delay and resource utilization,robust regression and dynamic threshold to predict traffic and detect exceptions respectively,binomial distribution detects exceptions in the success rate,and finally outputs a weighted binary decision result.Compared with other four typical algorithms(systems),the algorithm in this paper performs better in AIOps real data,with higher accuracy and lower false alarm rate,and supports real-time detection.(2)For the complexity of microservice system scale,the root cause localization algorithm is deeply studied.When there is an alarm in KPI monitoring,SRE is often unable to immediately perform fault demarcation and root cause location,and it needs a lot of manual analysis and location to finally complete the demarcation.In this paper,a fault root cause localization algorithm based on microservice call chain is proposed: the root cause localization module proposes that the average anomaly degree of neighbor nodes is used as the anomaly infection ability,and the maximum correlation coefficient of resource utilization and response time is used as the anomaly degree of nodes to randomly walk around to locate nodes;The root cause analysis module classifies the indicators of the service node in time series,uses the detection algorithm corresponding to the time series to calculate the degree of anomaly,and combines the maximum correlation coefficient between the indicators and the front-end response time to output the weighted anomaly score.Compared with other three typical algorithms,the algorithm in this paper can find root cause more accurately in AIOps real data. |