| In recent years,more and more developers have started to build applications based on cloud-native architecture.The “cloud-enabled” mechanism has become a key direction of digital transformation of enterprises.For the multiple time series data and dynamic operating environment generated by the expansion of the scaled system,the traditional experience-based manual monitoring methods can no longer meet the requirements of IT operation and maintenance.In this context,AIOps has emerged,aiming to achieve efficient and low-cost IT operation using artificial intelligence technology.AIOps consists of two key scenarios: anomaly detection and root cause analysis.Anomaly detection technology detects abnormal system behavior by analyzing the intrinsic characteristics of monitoring metrics.Root cause analysis technology locates the root causes that lead to system anomalies based on fault propagation diagrams.In recent years,scholars have proposed many anomaly detection and root cause localization methods based on system monitoring metrics.These works have achieved good results but still have a few limitations.In anomaly detection scenarios,the impact of multiple metric correlations on detection results has not been explored in depth.In root cause analysis scenarios,existing methods require manual tuning of parameters for different systems.To address the above issues,the main work of this thesis includes:(1)An anomaly detection method that incorporates attention mechanism-based prediction model and i Forest is proposed.Firstly,we combine feature attention mechanism and temporal attention mechanism to explore the potentially key information in the time series data.On this basis,we establish a sequence-to-sequence prediction model,and obtain the prediction residuals by comparing the predicted and true values of the time series data.Finally,we use the prediction residuals as the input of the i Forest algorithm to dynamically adjust the anomaly threshold based on the characteristics of datasets to achieve anomaly detection.The experiments on two datasets show that the proposed anomaly detection algorithm performs better than the classical anomaly detection method.(2)For root cause analysis,an automatic ranking method of anomalous microservices is proposed based on random walk algorithm.Firstly,system-level and application-level metrics are collected to construct a service dependency graph for the cloud-native system.Subsequently,the historical response time metrics are clustered to obtain the initial anomaly weights of each microservice node.Then the anomaly weights in the service dependency graph are automatically updated according to the anomaly propagation relationship between each microservice node itself and its neighboring nodes.Finally,the personalized Page Rank random walk algorithm is used to further rank the anomalous microservices.Experiments in cloud-native environments show that the root cause analysis method proposed in this thesis can efficiently locate the root cause of anomalies,while being robust to scalable cloud-native systems. |