| Deep reinforcement learning methods often use neural networks to estimate the value of the states of the environment and the actions of the agent,and determine the agent’s policy when interacting with the environment based on this value estimation.The overestimation or underestimation can mislead the policy improvement and accumulate errors through the temporal difference formula,leading to an unstable training process or local optimality,and ultimately affecting the final performance of the agent.Value estimation is crucial for the improvement of policy,and it is one of the key factor in determining the sampling efficiency of deep reinforcement learning.Current methods to improve the accuracy of value estimate usually apply some correction to the value estimation given by two or more function estimators,but this correction cannot usually be dynamically adapted to the specific training samples or training stage,and thus has limited control over error production.In fact,an appropriate over-and underestimation for specific training stages and experience can help the algorithm to explore the environment optimistically and prevent the policy from falling into high-risk areas.To address the above issues,this paper proposes a novel adaptive value estimation method that uses familiarity and uncertainty to adaptively adjust the value estimation,and this paper also proposes a method of prioritized experience replay using familiarity.The main research work is summarised as follows:1)This paper proposes a concept called familiarity,which measures the familiarity of a function estimator with each experience by counting the frequency of these experiences sampled in the replay buffer during training.A detailed formula for calculating familiarity is given,and the changes in familiarity of experiences during training are theoretically analysed and demonstrated,and further validated by simulation experiments.2)This paper presents an adaptive value estimation method based on familiarity and uncertainty.It uses an ensemble Q-learning approach that uses multiple function estimators to estimate values simultaneously,averages the multiple estimates in calculating the target Q-value,and dynamically adjusts the target Q-value using the variance of the multiple estimates and the familiarity of the experiences as penalty terms to balance the overestimation and underestimation.This method shows that for any experience,there is always a familiarity value that makes the bias of the function estimator approximately zero.3)This paper proposes a method that uses familiarity to prioritized experience replay,which can allow more novel and more valuable experiences to be prioritised for replay,thereby improving the sampling efficiency of the algorithm.In addition,this paper also uses a dual-actor learning framework that can enhance the exploration ability of deep reinforcement learning algorithms,allowing the policy to take actions with higher Q-values.4)This paper combines the proposed method with several state-of-the-art deep reinforcement learning algorithms separately and uses different UTD ratios for evaluation.In several continuous action domain tasks,the proposed method achieves higher sampling efficiency than similar methods in all tasks,demonstrating the superiority of the proposed algorithms.In addition,ablation experiments are performed to verify the effectiveness of the different components of the proposed algorithms.This paper applies the proposed concept of familiarity to the value estimation and prioritized experience replay of deep reinforcement learning algorithms respectively,both of which can improve the sampling efficiency of the algorithms and allow the agents to achieve higher performance,and have a positive effect on the practical application of deep reinforcement learning methods. |