| Offline deep reinforcement learning algorithm combines traditional deep reinforcement learning with offline learning,which is one of the research hotspots in the field of machine learning.The offline algorithm learns from the offline data set obtained from a series of task interactions.This feature has high application value in the fields of robot technology and automatic driving technology.Because the offline data set usually can’t contain all state action pairs,the offline algorithm inevitably has overestimation of action value,model deviation and unstable effect.Aiming at the above problems,this thesis mainly works the following three aspects:i.In reinforcement learning,plot cumulative return is a complete evaluation of a series of actions of agents.The traditional experience playback method does not consider the impact of plot cumulative return on network training.The priority based experience playback method reduces the training efficiency of the algorithm to a certain extent because it needs to update the priority of experience samples in each stage of network training.To solve the above problems,considering the storage process of experience samples and taking the plot cumulative return as the basis of sample classification,a depth deterministic strategy gradient algorithm based on plot classification experience playback is proposed.Experiments show that the algorithm performs well in a variety of continuous control tasks by making efficient use of the past successful experience.ii.Meta learning method.The offline algorithm of truncation error has the problem of incomplete distribution of training data,so that the necessary state action pairs or access times are missing in the training process,resulting in the unstable results of experimental training.The algorithm depends on the distribution of offline data sets.To solve this problem,an offline deep reinforcement learning method based on meta learning method is proposed.An initial network parameter is constructed through meta learning method to improve the network adaptability and learning ability and alleviate the deviation of strategy network model.The effect of the algorithm learned from various data sets is stable.The experimental results in continuous control tasks show that the algorithm has better robustness.iii.Select the playback method according to the classification of historical actions.The mainstream optimization method of offline algorithm is to limit the action selection through the network model,so as to control the distance between the behavior strategy distribution and the target strategy distribution.Through this method,the generation of truncation error is also known as controlling the generation of extrapolation error.Inspired by this method,from the perspective of controlling the sampling process of offline data set,a deep offline reinforcement learning method of historical action classification and playback is proposed.This method improves the performance of the algorithm by improving the traditional empirical playback method in the offline algorithm.The offline data set is divided into two parts:historical action priority data set and original data set.The training process fully balances the relationship between exploration and utilization,and truncates the error from the perspective of empirical playback.It makes up for the random and blind characteristics of offline depth enhancement algorithm in empirical selection,makes the algorithm obtain comparable training results,and provides a new idea for the optimization direction of offline algorithm.On the basis of off-line reinforcement learning,the above three aspects optimize the problem from different angles around the problems such as overestimation of action value and model deviation in off-line learning,and can achieve good experimental results. |