| In recent years,artificial intelligence(AI)has become one of the important national development strategies for any countries in the world.In the field of AI,it is still a challenge to create an intelligent system that can autonomously learn to solve the corresponding decision-making tasks by perceiving the external environment.Reinforcement learning(RL)is considered as an effective technology to address the above concerns.In RL,the agent learn the optimal policy by interacting with the environment.In order to explore optimal policy,the value-based RL algorithms adopt the greedy policy improvement mechanism for improving policy during policy improvement phase.However,the value function estimation errors are not considered when executing greedy policy improvement.As a result,the policy performance difference between the learned and optimal policies increases.Besides,the policy improvement process becomes more unstable.In addition,the policy-constrained RL algorithms,such as proximal policy optimization(PPO)and trust region policy optimization(TRPO),approximately constraint the policy difference between the updated policy and the old one in each policy update to ensure the stability of the policy update process.However,it is not known how to combine the idea of constrained policy optimization with the policy gradient-based RL algorithms.Considering this,the main purpose of this thesis is to study how to improve the stability,learning speed,and reduce the policy performance difference of policy-constrained RL in online and offline situations respectively.The main work of this thesis includes the following four aspects:(1)Aiming at the problem of the conservatism of PPO’s clipping boundary,the authentic boundary proximal policy optimization(ABPPO)is proposed.The effect of PPO’s clipping operation on the objective function of the conservative policy iteration and the relationship between PPO’s clipping boundary and TRPO’s trust region boundary are analyzed.Then,a first-order policy gradient algorithm called ABPPO is proposed,which is based on the authentic boundary setting rule.Moreover,to ensure the difference between the new and old policies is better kept within the clipping range,two improved PPO algorithms: rollback clipping-based ABPPO(RMABPPO)and penalized policy probability difference-based ABPPO(P3DABPPO)are proposed,which are based on the ideas of rollback clipping and penalized policy probability difference,respectively.(2)To address the accurate estimation of Q-function and enhance the agent’s exploration ability of off-policy actor-critic(AC)algorithms,the robust actor-critic(RAC)with relative entropy regulating policy improvement is proposed.Firstly,a robust policy improvement mechanism(RPIM)is derived by using the local optimal policy about the current estimated Q-function to guide policy improvement.By constraining the relative entropy between the new policy and the previous one in policy improvement,RPIM can improve the stability of the policy update process.The theoretical analysis shows that the incentive to increase the policy entropy is endowed when the policy is updated,which is conducive to enhancing the exploration ability of agents.Then,RAC is developed by applying the proposed RPIM to regulate the policy improvement.Finally,the developed RAC is proven to be convergent.(3)To improve policy iteration(PI),the dual parallel policy iteration(DPPI)with coupled policy improvement mechanism is proposed.In contrast to the common PI,the developed DPPI considers two parallel policy iterations.At each policy iteration step,the performances of the two parallel policies are evaluated and the better one is defined as the dominant policy.Then,the dominant policy is used to guide the parallel policy improvement in a soft manner.Besides,the theoretical analysis shows that under certain conditions,the Q-functions of the two new policies obtained in each parallel policy improvement are larger than those of all the previous dominant policies,which is conductive to accelerate the policy iteration process.Moreover,it is proven that the convergence of DPPI can be guaranteed.Furthermore,the parallel TD3(PTD3)is proposed by combining DPPI with the twin delay deep deterministic policy gradient(TD3).(4)To address the problem of distribution draft in offline RL,the generalized offline actor-critic(GOAC)with behavior regularization is proposed.GOAC constrains the skew-symmetric Jensen Shannon(JS)divergence between the current and behavior policies for alleviating the effect of the distribution draft on the policy update process and reducing the policy performance difference between the learned and optimal policies.The theoretical analysis shows that since skew-symmetric JS divergence is bounded,the policy performance difference of GOAC can be reduced.Besides,an auxiliary network is designed for estimating the skew-symmetric JS divergence between the behavior policy and current policy.Moreover,the convergence of the GOAC is presented.The effectiveness of all proposed RL algorithms is evaluated on the continuousaction tasks on the Open AI Gym and Mu Jo Co platforms.Experimental results show that all proposed RL algorithms achieve or exceed the state-of-the-art RL algorithms in reward,stability,and learning speed. |