Font Size: a A A

Research On Optimization Method Of Deep Reinforcement Learning Experience Replay

Posted on:2022-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:P F LiuFull Text:PDF
GTID:2518306533472224Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the gradual development of deep reinforcement learning,its related theoretical results are gradually improved and successfully applied in many fields.However,when faced with a complex and high-dimensional environment,deep reinforcement learning agent requires a lot of time to train and the sample utilization rate is low,which leads to low learning efficiency of the agent.The experience replay mechanism of deep reinforcement learning improves sample utilization rate by reusing historical experiences.For the existing experience replay algorithms,there are deficiencies in uniformly sampling,inability to suppress outlier data,and ignoring state information.This article optimizes the experience replay method in deep reinforcement learning to improve the sample utilization rate.In response to this problem,this article optimizes the experience replay.The main research content includes the following three parts:(1)For the problem that the traditional hindsight experience replay usually cannot distinguish the importance of samples,a priority-based hindsight experience replay is proposed.Firstly,define a trajectory distance function to describe the average distance between the achieved goal and the target task in a round;secondly,according to the average distance in the round samples,calculate the priority value of the trajectory and determine its priority probability;finally,In order to alleviate the problem of overestimation of rewards,according to the distance between the achieved goal and the target task,the reward function of the hindsight experience samples is redefined.(2)For the problem that prioritized experience replay using the mean square error loss function cannot suppress the outlier data,a prioritized experience replay based on adjusting the TD-error loss function is proposed.Firstly,for the problems caused by the mean square error,the Huber loss function is used instead of the mean square error loss function;secondly,in order to avoid the bias caused by the mean square error and priority division,the priority values of samples with the TD-error less than the threshold are clipped to a threshold to ensure uniform sampling of samples with the TD-error less than the threshold;The priority values of samples with a TD-error greater than the threshold is itself,and the sample priority is determined based on the TD-error.This method can avoid the bias caused by non-uniform sampling.(3)For the problem that the state information is not used in the experience replay,the experience replay based on the state information entropy is proposed.Firstly,normalize the state of the transfer sample generated by the interaction between the agent and the environment,solve the probability of the state variable after normalization,and calculate the state information entropy;secondly,put the transfer sample with the state information entropy into the sample queue and extract the first quartile of the sorted sample queue;finally,two experience replay units are set to store different samples to ensure sample diversity.The three methods mentioned above are based on Gym,Atari 2600 games,and MuJoCo control tasks for experimental verification.The results show that compared with other classic reinforcement learning algorithms,the proposed experience replay optimization method can make the agent have better performance and improve the learning efficiency.There are 24 figures,20 tables,and 81 references in this thesis.
Keywords/Search Tags:experience replay, priority, huber loss, state information entropy
PDF Full Text Request
Related items