| Twin Delayed Deep Deterministic Policy Gradient(TD3)algorithm is a commonly used deep reinforcement learning algorithm,which is built on top of the Deep Deterministic Policy Gradient(DDPG)algorithm and is simple and easy to implement,and the agent adopts the learned deterministic policies in application scenarios with relatively stable performance.However,the model architecture adopted by the algorithm,as well as the policy improvement mechanism employed,still has some shortcomings.To further improve the performance of TD3 algorithm and increase its application value,the following three main aspects are done in this paper:i.TD3 algorithm based on Dropout.TD3 algorithm improves the policy based on deterministic policy gradients,but the process involves the calculation of the action-value gradient,and the inaccurate action-value gradient can mislead the policy improvement.In addition,using this gradient-based approach for policy improvement,the algorithm is prone to fall into local optimal solutions.To solve the problems above,Dropout,a regularization method commonly used in the training process of deep neural networks,is introduced,and a new policy improvement mechanism that does not require the calculation of the actionvalue gradient is proposed.Applied to TD3,the Dropout-based TD3 algorithm is proposed.The algorithm is applied to a variety of robotic control tasks to perform experiments,and the method is confirmed to improve the level of policy improvement and alleviate the problem of the algorithm falling into a local optimal solution.ii.TD3 algorithm based on policy distillation.TD3 algorithm introduces an additional set of critics to improve the accuracy of value estimation,but the algorithm uses a single actor,thus the level of agent’s exploration is limited,and the critics’ information is not fully utilised.With the introduction of additional actors,consideration needs to be given to how the actors can collaborate efficiently with each other.To solve the problems above,MasterSlave Architecture for Policy Collaboration(MSPC)is proposed and applied to TD3,and the TD3 algorithm based on policy distillation is proposed.The algorithm is applied to a variety of robotic control tasks to perform experiments,and the method is confirmed to have better sample efficiency and performance.iii.TD3 algorithm based on group collaboration.The TD3 algorithm based on policy distillation adopts two sets of actors for learning,and the advantage of the algorithm is not obvious when dealing with more complex tasks,so more sets of actors and critics need to be introduced and the group of actors needs to collaborate efficiently.To achieve these aims,the Clipped Ensmeble Q-learning(CEQL)mechanism and the Cycle Experience Replay(CER)mechanism are proposed respectively.With the help of two mechanisms,the TD3 algorithm based on policy distillation is improved and several TD3 algorithms based on group collaboration are proposed.The algorithms above are applied to perform experiments in a variety of robotic control tasks,and it is confirmed that the methods possess higher learning efficiency as well as better performance. |