Font Size: a A A

Research And Implementation Of Fault Location Method For Distributed Networked Systems

Posted on:2023-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:X M LvFull Text:PDF
GTID:2568306914481654Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the grow of cloud computing and edge computing,distributed information network architecture bearing micro-services and newgeneration communication networks has been widely used in Internet,telecommunications and other fields.Distributed information network system has the characteristics of high reliability,strong expansion ability,resource sharing,high performance and so on.It has become the essential infrastructure of current information communication system.However,the distributed system’s complex topology and strong correlation between nodes pose challenges to its effective operation and maintenance.When the system performance deteriorates or a fault occurs,locating the root cause of the fault in a complex network topology becomes an important requirement for the operators.Locating and locating faults based on the experience of the operators has great limitations and is not suitable for the current application scenarios of complex distributed systems.Artificial Intelligence for IT Operations(AIOps)provides a very effective means for locating the root cause of faults in distributed systems,and has become one of the research hotspots.However,due to the complex network topology,fault propagation frequently occurs in the system.Meanwhile,the existing methods ignore the influence of fault propagation between nodes on location and lack of explanation of propagation behavior.In addition,due to different fault mechanisms and data in different scenarios,it is difficult to ensure flexibility and reusability by focusing only on a single scenario.This thesis pays attention to the research and implementation of fault location methods for distributed networked systems.The main research results are as follows:1.Aiming at the problem of the interpretability of fault propagation in different scenarios,this paper proposes a fault location method called Root Cause Analysis based on Weighted Fault Propagation Graph(WFPGRCA).After the fault propagation graph is constructed based on the real network topology and the characteristics of operation and maintenance data during faults,the root cause information is mined from the operation and maintenance data,the weights and physical meanings of nodes and directed edges are assigned to the graph,and the root cause is determined by the root influence ranking algorithm.Finally,the fault propagation behavior is explained based on the location results and edge weights in the propagation graph.Experimental results show that the positioning accuracy of WFPG-RCA method is superior to other baseline methods in the dataset of micro-service,e-commerce platform and bearer network scenarios,and it is interpretable as well as accurate positioning.2.Aim to solve the problem that the existing intelligent operation and maintenance platform lacks a variety of fault location methods and model understanding functions,we design and finish a fault location learning component platform for distributed network system.In addition to basic data management functions,the platform introduces a variety of fault location algorithms to meet the requirements of the operators for fault location in different scenarios.Also,based on the idea of learning components,the adjustable parameters are opened for the fault location algorithm,and the results under different parameters are displayed in the form of charts and reports to increase the user’s understanding of learning components.In addition,the addition of the model library is convenient for operation and maintenance personnel to manage and optimize the model,which makes the model have the ability of evolution.Finally,the paper verifies the stability of the learning component platform based on the designed test samples and actual use experience,in order to ensure the normal use of the platform.
Keywords/Search Tags:distributed system, AIOps, fault location learning, components platform
PDF Full Text Request
Related items