| The progress of representation learning motivates people to pursue not only the abstract features of data,but also the ability of machines to perform decision tasks.Deep reinforcement learning is a decision learning method that integrates deep learning and reinforcement learning.This thesis mainly focuses on the research and application of deep reinforcement learning strategy models for continuous action space environments and discrete action space environments.In recent years,TD3 algorithm has been one of the popular decision models for continuous action space environments,and has received wide attention.It can effectively be used for decision tasks in continuous action space environments.However,TD3 algorithm still needs to be improved in terms of training stability,convergence speed and experience data utilization efficiency.In addition,PPO algorithm has also been one of the hot algorithms that researchers pay close attention to for discrete action space environments,but PPO algorithm cannot effectively use historical information to guide current decisions when dealing with partially observable environments.To solve these problems,this thesis studies the characteristics of these algorithms and proposes corresponding improvement algorithms to effectively solve the problems they have.The main research work of this thesis is summarized as follows:(1)This thesis proposes DP-TD3 algorithm to address the problems of model instability and data underutilization of TD3 algorithm.DP-TD3 algorithm has two innovations: one is to use a dueling architecture for the critic network,which decomposes the actor network into advantage function and state value function,achieving more stable Q-value estimation and enhancing the expression ability and generalization ability of the actor network.The other is to introduce prioritized experience replay mechanism,which uses the loss function value as the priority of experience data,improving the sampling rate and learning efficiency of high-value data.This thesis compares DP-TD3,TD3 and DDPG algorithms in four Mu Jo Co environments with different environments and reward settings in terms of average return value and convergence speed.The experimental results show that DP-TD3 algorithm achieves the highest average return value in three environments and has the fastest convergence speed.This indicates that DP-TD3 algorithm can effectively solve the problems of overestimation,data underutilization and generalization deficiency existing in TD3 algorithm and DDPG algorithm,and improve the performance of deep reinforcement learning in robot continuous action space control tasks.(2)This thesis proposes LSTM-PPO model to address the problem of insufficient use of historical information by PPO algorithm in partially observable game environments.LSTM-PPO model is a dual-channel deep reinforcement learning model,which processes environment images and clue information separately and uses LSTM units to capture temporal information.In addition,this thesis analyzes the characteristics of LSTM-PPO model and describes the specific steps of the algorithm in detail.And it compares with common DRL baseline methods in four Atari games to verify that the proposed LSTM-PPO model can effectively capture temporal information and improve decision performance.Finally,we apply LSTM-PPO model to a popular real-time strategy game “Star Craft 2” and show a high win rate in experimental results.This indicates that LSTM-PPO algorithm can effectively solve the problem of insufficient use of historical information by PPO algorithm in partially observable game environments.Overall,this thesis studies and applies deep reinforcement learning strategy models for continuous action space and discrete action space environments,and proposes DP-TD3 algorithm and LSTM-PPO algorithm,which respectively solve the problems of model stability,data utilization efficiency and temporal information utilization of TD3 algorithm and PPO algorithm.The experimental results show that the proposed algorithms achieve better performance than baseline methods in multiple environments,and show a high win rate in a popular real-time strategy game.This thesis has important value for effectively improving the decision ability of deep reinforcement learning in complex environments. |