| Deep reinforcement learning combines the representational power of deep neural networks and the decision-making power of reinforcement learning,and has been widely used in various fields with good performance.However,deep reinforcement learning is difficult to achieve better performance when faced with more complex task scenarios.Deep hierarchical reinforcement learning combines the idea of divide and conquer,by decomposing large-scale tasks into small-scale subtasks and solving them separately,it can effectively solve the "Curse of Dimensionality" problem and the sparse reward problem,which are difficult to be handled by traditional reinforcement learning.Option-Critic framework is a mainstream framework in deep hierarchical reinforcement learning researches,which can achieve end-to-end learning of internal policies and termination functions through policy gradient theory.However,the OptionCritic framework suffers from degradation problems during the policy learning process,such as similarity of Option sets,low knowledge transfer ability of lower-level policies and limited exploration ability of the agent.To address these problems,research is carried out in the direction of Option diversity,policy transfer and guaranteeing exploration by optimizing truncation parameters,and the related research includes the following three aspects:i.Option-Critic framework in the process of policy learning,the set of Option will tend to be similar,an Option-Critic Algorithm with Mutual Information Optimization(MIOC)is proposed to address the problem.MIOC encourages different Options to take different actions in the same state by introducing the mutual information knowledge between Options and actions as internal rewards,which can ensure the diversity among Options.The method is verified to ensure the diversity of options and improve the experimental performance through comparative experiments in multiple sets of consecutive environments.ii.Using internal driver to guarantee Option diversity can lead to slow learning of algorithms and low knowledge transferability of policies.To address this problem,a Diversity-Enriched Option-Critic Algorithm with Interest Functions Optimization(DEOCIF)is proposed.The DEOC-IF algorithm,by introducing interest functions to limit the selectivity of the upper-level policy in choosing the lower-level policy,ensures the diversity of the Option set,but also enables the learned internal policies to focus on different regions of the state space,which is conducive to improving the knowledge transfer ability of the algorithm and accelerating the learning speed.Experimental results show that the algorithm is effective.iii.Using fixed truncation parameters can lead to a lack of exploration of the agent at the early stage of policy training and affect the experimental results.To address this problem,the Proximal Policy Option-Critic Algorithm Base on Optimized Clipping Parameter(OCP)is proposed.OCP algorithm introduces two decaying forms of truncation parameters to constrain the update of the lower layer policy,which can ensure that the agent has a certain exploration ability in the initial stage of policy training and the stability of policy update at the end of policy training.Comparative experiments are conducted in a continuous experimental environment,and the results show that the algorithm has faster learning speed and experimental performance. |