| Reinforcement learning(RL)can optimize its control policy using real-time reward signals obtained from continuous interactions with the environment,thus it has a strong self-adaptability and self-learning ability,and has recently become a key method for autonomous robot navigation.This dissertation addresses the three issues,i.e.,failure recovery,learning safety and re-learning efficiency,in practical deployment of RL in unknown dynamic environment.The main contents are as follows:(1)For the failure recovery issue,a safe and self-recoverable reinforcement learning framework is proposed for autonomous navigation.This framework monitors the robot’s action selection and prohibits unsafe behaviors through a safety guarantee module.On the other hand,it provides a failure self-recovery method to endow the robot with the capability of recovering to past safe states,which is more suitable for the practical deployment of RL.Experiment results show that this framework is able to recover from failures by itself and reduce the number of resets.In the meantime,it also has fewer failures and faster convergence than traditional reinforcement learning methods.(2)For the learning safety issue,a few-shot reasoning based safe reinforcement learning method is proposed for autonomous navigation.This method combines few-shot learning with reinforcement learning to reason unknown obstacles and related unsafe actions to improve the safety during the exploration.Meanwhile,the support set in fewshot learning is dynamically managed to better fit the environment so that the reasoning can be continuously improved.Experiment results show that the few-shot reasoning method based on dynamic support set can improve both the accuracy of obstacle recognition in navigation environments and the learning safety of reinforcement learning.(3)For the re-learning efficiency issue,a fast reinforcement learning method with experience reuse is proposed for autonomous navigation.This method converts the dynamic change in the environment into the influence of the obstacle on original optimal policy,and determines the specific local parts to make new explorations and find a new local policy based on the number of obstacles obstructing the original path.The final updated policy is achieved by combining the new local policy with the policy in the other parts that remains the same as before.Experiment results show that this method can significantly reduce the robot’s exploration time in unimportant regions,thus improving the re-learning efficiency. |