Due to the characteristics of on-demand service schema,elasticity,and high availability of cloud computing,cloud platform has been widely used in various fields in recent years,and the scale of services based on cloud platform has exhibited a rapid growth trend.Simultaneously,the cloud computing environment is becoming more complex and diverse,and the difficulty and cost of operation and maintenance are rising,making cluster monitoring and administration a demanding undertaking.This paper designs and implements a monitoring and alerting system for computing clusters in order to improve cloud platform stability,lower operating and maintenance costs,and ensure service quality,as well as to meet the monitoring needs of the subject group’s cloud encryption system key server cluster.Based on the study of cluster monitoring technology and combined with the actual needs of monitoring and alerting system,the system architecture is designed and implemented,including two models of monitoring metrics and alert rules,four functional modules of data scrape module,alerting module,data storage module and system management module.In order to realize automated monitoring and scraping,the data scrape module introduces a service registry to manage monitoring objects and reduce monitoring and maintenance costs.In order to realize automated monitoring and scraping,the data scrape module introduces a service registry to manage monitoring objects,which reduces monitoring and maintenance costs.At the scraping agent side,the basic resource monitoring agent is implemented to obtain virtual machine hardware information,and the agent client library is implemented to monitor application business metrics,which greatly simplifies the process of application access to the monitoring system.Considering the timeliness of alerting judgment,the local cache module is implemented to store time series data,and the access time performance and space performance of the local cache are ensured by inverted indexing and data compression algorithm,which supports real-time alerting judgment.The alerting module implements efficient alerting based on the local cache,uses Redis backup queues and blocking commands to ensure the reliability of alert messages,and supports various alerting notification methods such as email,SMS,and dingding.The system management module provides an intuitive and easy-to-use web interface and a one-stop management system.After the verification of functional and non-functional tests,a monitoring and alerting system for computing clusters in the cloud computing environment is finally completed,and the monitoring and alerting functions for clusters and services are realized,creating a firm foundation for the steady operation of clusters and services in the cloud platform. |