Font Size: a A A

Research On Proactive Prediction Method Of Job Failure For Supercomputers

Posted on:2024-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:G XianFull Text:PDF
GTID:2558307073968579Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Currently,high-performance computing technology is being applied in scientific research,engineering,and other fields,providing powerful floating-point operation support to solve practical problems such as ultra-large-scale high-precision simulation.This trend is becoming increasingly evident in the future.As the demand for capability computing and capacity computing continues to grow in various fields,various supercomputing centers are investing more and more in high-performance hardware facilities,updating or expanding equipment to improve platform service capabilities.At the same time,how to improve resource utilization and reduce operating costs has become a key challenge for stable and efficient operation of supercomputing systems and applications.The computing power of supercomputers has entered the Exa Flops level,and the structures of their systems such as computing,networking,I/O,and storage are becoming increasingly complex.Reliability is considered the third major challenge after parallelism and energy efficiency management.Any measures aimed at improving reliability face the challenge of having to respond to potential failures which may occur at any time.Therefore,a significant amount of research is focused on predicting faults in order to take mitigating measures before they occur.This approach allows for the application of mitigation measures while the system is still functioning normally,making it easier to avoid the expensive cost of data recovery and storage required after a failure occurs.This article focuses on the problem of a large number of job failures in current supercomputers,which leads to extended waiting times for queued jobs,resource waste,and other issues.By analyzing and mining features related to job status from job logs and other monitoring information,and based on machine learning algorithms,a method for predicting job failures has been implemented to improve the prediction accuracy.This ensures that operations and maintenance staff can respond promptly to potentially failed jobs and improve system resource utilization.This article focuses on the design of proactive prediction methods for job failures in supercomputers and solves a series of key technical problems.The specific research content is as follows:(1)Exploring large-scale Slurm job logs and other monitoring data,conducts in-depth analysis on the semantic composition information of jobs,and defines and refines job application types based on job names and submission paths.Based on this foundation,deeper exploration of job feature attributes is conducted,and different proactive prediction methods for job failures are implemented,namely,static prediction and dynamic prediction.The two prediction methods combine the characteristics of supercomputers,use data mining techniques and feature engineering methods,establish multiple job failure prediction models to adapt to different job failure situations,and ultimately improve the prediction accuracy and stability.(2)A static prediction method for job failure is designed.Based on the determination of job application types,a clustering strategy for job names and submission paths is proposed,and users’ repeated submission behavior for jobs is explored,with both job submission paths and user submission behavior used as new input features for prediction.Static prediction includes coarse-grained prediction and fine-grained prediction,and the most effective algorithm model is selected during the prediction process to adapt to users’ actual job situations.Experimental results show that this method can achieve an accuracy of89.08%,which is a significant improvement compared to traditional feature-based prediction.(3)A dynamic prediction method for job failure is designed.This method establishes a model training time window and a prediction time window,and fully utilizes the correlation between jobs of the same application type within a short period of time by continuously updating the training set.Meanwhile,based on the research on the static prediction method,a clustering strategy for job names and submission paths suitable for dynamic prediction is proposed.This strategy clusters based on the character priority of semantic composition and proposes corresponding rules for determining the sequence correlation of job applications.By dynamically selecting the most effective algorithm model for prediction through the time window,experimental results show that the accuracy can reach 89.78%,and the overall prediction performance has been improved,especially in terms of specificity and sensitivity indicators.
Keywords/Search Tags:Reliability, User behavior, Job application relevance, Job failure prediction, Machine learning
PDF Full Text Request
Related items