| The goal of reinforcement learning is to maximize cumulative extrinsic rewards.Reward is a source of motivation for improving reinforcement learning strategy,but most tasks often do not have ideal dense extrinsic rewards.Exploratory reinforcement learning and hierarchical reinforcement learning are often used to solve tasks with sparse extrinsic rewards.There are some problems with reinforcement learning methods that use intrinsic motivation to explore.For example,the process of calculating intrinsic reward is too complicated,and most of the methods ignore the role of the state in its episode.There is a problem with the goal-based hierarchical reinforcement learning method and this method blindly selects goals and lacks guidance.In order to better solve the problem of sparse extrinsic rewards,this thesis studies the specific problems in the above methods,and the following works have been done:(1)In order to solve the problems of complicated calculation process of intrinsic reward and the role of a state in its episode being ignored in intrinsic motivation reinforcement learning methods,this thesis makes full use of the role of state in its episode and designs an intrinsic reward function with a relatively simple calculation process.This method does not need to measure the agent’s familiarity with the states.It calculates an intrinsic reward based on the distances between the next state and the historical states in the same episode,and then uses this intrinsic reward to encourage the agent to move away from the currently recently visited old region,while also preventing the agent from looping.This thesis performed experiments in discrete environments with sparse extrinsic rewards.The results show that the intrinsic reward function can effectively improve the exploration ability of the agent,and then efficiently solve the tasks with sparse extrinsic rewards.(2)In order to solve the problem of blind goal selection and lack of guidance in goal-based hierarchical reinforcement learning method,this thesis proposes a goal selection method.This method quantifies the agent’s degree of mastery of the goal as the number of successes of the goal.The higher the number of successes,the better the agent has a mastery of the goal.This thesis appropriately increases the probability of a goal or trajectory with a low number of successes selected,so that the strategy focuses on learning those goals that have not yet been mastered,and reviews the goals that have been mastered in time to avoid forgetting.The proposed method was used for virtual goal selection of the algorithm named hindsight experience replay and this thesis performed experiments in continuous environments with sparse extrinsic rewards.The results show that the proposed method improves the hindsight experience replay algorithm and provides more reasonable virtual goals for it. |