Font Size: a A A

Research On Safe Reinforcement Learning Guided By PertSTL~* Online Monitor

Posted on:2024-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:J N ChenFull Text:PDF
GTID:2558307067493384Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Cyber physical system(CPS)is widely used in autonomous driving,aerospace and other safety-critical systems.Existing rule-based CPS controllers suffer from poor scalability,reliance on manual design by domain experts,and failing to adapt to unknown environments.On the other hand,controllers based on Reinforcement Learning(RL)have powerful advantages in handling high-dimensional states and uncertain environments.However,they ignore the loss and cost during the learning process,which cannot effectively guarantee the safety of decision.In addition,slight disturbances to sensors in real physical systems and the uncertainty of unknown environments pose new challenges to the formal verification of CPS system.To address the safety issues of reinforcement learning controllers,we innovatively propose a safe reinforcement learning method by the guidance of PertSTL~* online monitor.Firstly,we expand the Perturbed Signal Temporal Logic(Pert STL),and propose an efficient online monitoring algorithm to analyze the impact of random signal perturbation on safety verification results.Then,we use PertSTL~* to describe the safety requirements that the RL agent needs to obey during the exploration process and monitor them in real-time.Based on the quantitative evaluation results obtained by online monitor,we further reshap the reward function and improve the experience replay mechanism in DRL,which improves the learning efficiency of the agent and the credibility of the decision-making.Experimental results show that the proposed method has significant improvements in convergence speed,security,compared to the traditional DRL algorithm.The main contributions of this paper include:1.We define PertSTL~* and an efficient online monitoring algorithm is presented.Firstly,we define the syntax,boolean semantics,and robust semantics of PertSTL*,and the worst-case perturbation calculation method is analyzed under different semantics.For robust semantics,an efficient online monitoring algorithm for PertSTL~* is proposed,which relies on ANLTR tool to parse and construct the abstract syntax tree of PertSTL~*,and then recursively calculate the quantitative evaluation results of PertSTL~* from bottom to top.In addition,the computation efficiency is significantly improved by using the modified maximum-minimum filtering algorithm and the concept of expiration time,so as to reduce the redundant calculations in the recursive process.2.An innovative safety reinforcement learning based on the guidance of PertSTL~*online monitor is proposed,which combines runtime verification technology and reinforcement learning algorithm.Firstly,it converts the safety constraints described by PertSTL~* into optimization objectives in the learning process of the agent.Furthermore,the robust semantics of PertSTL~* are used to further improve the RL algorithm.On the one hand,the reward reshaping algorithm is modified based on the robustness value,so that the reward function is optimized within the safety scope.On the other hand,the experience replay mechanism is enhanced by evaluating the robustness of historical experience,so that the intelligent agent can prioritize learning experiences that meet PertSTL~* safety constraints during the training process.Finally,taking the Deep Deterministic Policy Gradient(DDPG)algorithm as an example,we explain in detail how to improve traditional reinforcement learning algorithms based on the PertSTL~* online monitoring mechanism.3.Taking the the vehicle following scenario in autonomous driving as an example,we validate the proposed method.Firstly,we introduce the experiment simulation environment and the parameter settings of the DDPG algorithm used in the experiment.Then,the vehicle following scenario and the safety constraints that the autonomous vehicle needs to satisfy in this scenario are described.The impact of perturbation signals on safety verification results is analyzed,and the differences in cumulative reward and convergence speed between the proposed method and traditional DDPG are compared through comparative experiments.Finally,to explore the impact of random disturbance signals on the reinforcement learning algorithm,different levels of perturbation are applied to the sensors,and their impact on the experimental results is observed.The results show that the method we proposed has significant improvements in convergence speed,decision-making safety,and resistance to random signal interference compared to the traditional DDPG algorithm.
Keywords/Search Tags:Signal Temporal Logic, Online Monitor, Runtime Verification, Reinforcement Learning
PDF Full Text Request
Related items