| Reinforcement learning has the advantages of trial-and-error learning,no need for annotated data,sequential decision-making process,and autonomous learning architecture,making it suitable for a wide range of practical applications.However,traditional reinforcement learning has poor scalability,often failing to achieve ideal results when the environment is complex or the task is difficult.In recent years,with the rapid development of deep learning,its powerful function approximation and feature representation capabilities have enabled breakthroughs in many fields.Similarly,deep reinforcement learning,generated by combining deep learning with reinforcement learning,has greatly promoted the development of reinforcement learning and making it effective in solving complex practical problems in high-dimensional spaces,becoming one of the most popular topics in artificial intelligence.Unfortunately,due to the neural network fitting error and poor stability of the algorithm,existing deep reinforcement learning algorithms still have some limitations when facing continuous action spaces and discrete large action spaces,hindering their wider application.This dissertation focuses on deep reinforcement learning in continuous action spaces and discrete large action spaces,relying on two specific application scenarios of simulated robots and recommendation systems,and focuses on the three key issues it faces:low exploration efficiency,sparse rewards,and value estimation bias.The main contributions of this dissertation are summarized as follows:(1)To address the problem of low sample exploration efficiency of reinforcement learning algorithms in continuous control environments,this dissertation proposes a new Actor-Critic algorithm,WPVOP,based on Weakly Pessimistic Value estimation and Optimistic Policy optimization.WPVOP proposes two algorithms based on Actor and Critic,respectively.On the Critic side,a weak pessimistic value estimation algorithm is proposed,which compensates for the pessimistic lower bound of the Q-value to stimulate exploration in the low Q-value region,thus avoiding the algorithm getting trapped in local optimal values.Meanwhile,this algorithm can effectively avoid the overestimation problem of Q-values and maintain the stability of the algorithm.On the Actor side,an optimistic policy updating algorithm is proposed,which updates the policy towards the high Q-value region,thereby accelerating the policy iteration process.Experimental results show that compared with existing deep reinforcement learning algorithms,WPVOP exhibits higher exploration efficiency and performance in the continuous control environment MuJoCo.(2)To address the problem of sparse user interaction information in recommendation system,this dissertation proposes a new REINFORCE-based relative distance re-ranking algorithm called ReinRank(Relative Ranking information).Specifically,this dissertation utilizes the relative ranking information of the re-ranked items,constructs an implicit reward based on the REINFORCE algorithm,and proposes a new re-ranking loss function called ReinRank.ReinRank effectively alleviates the problem of model optimization bias caused by sparse rewards,providing a new perspective for future developments of re-ranking optimization.Additionally,this dissertation proposes a simple and efficient re-ranking architecture called NRCF(Neural Reranking-based Collaborative Filtering),which can utilize the initial top-k recommended items to obtain users’ implicit preferences for re-ranking.Experimental results show that NRCF and ReinRank can be applied to different collaborative filtering models and effectively improve their re-ranking performance.(3)To address the problem of low sample exploration efficiency in reinforcement learning-based recommendation systems,a new Hybrid Optimistic Random Q-ensemble algorithm,called HORQ,is proposed in this dissertation.HORQ proposes a hybrid globallocal Q-network that can maintain both local and global features of user interaction information,enabling efficient user interest feature representation.On the other hand,HORQ theoretically describes the Q-value underestimation problem in reinforcement learning algorithms caused by the uncertainty of buried positive feedback(generated by sparse user interaction information in recommendation systems),and introduces an optimistic random ensemble Q-learning algorithm to alleviate the underestimation bias problem caused by the uncertainty of buried positive feedback.Experimental results show that compared to existing deep reinforcement learning-based recommendation algorithms,HORQ can significantly improve the exploration efficiency and performance of reinforcement learning algorithms in recommendation systems.(4)To address the problem of Q-value estimation bias caused by the gradually diminishing available action space in many recommendation systems,a new Q-learning based Action Diminishing Error Reduction algorithm,Q-ADER,is proposed.This dissertation first analyzes the problem caused by diminishing action space in reinforcement learning algorithms and theoretically proves that diminishing action space will introduce value estimation errors in standard TD updates and lead to suboptimal policies in vanilla DQN algorithm.To alleviate this problem,this dissertation proposes the Q-ADER algorithm to reduce the error caused by the diminishing action space.Experimental results show that Q-ADER can effectively mitigate the Q-value estimation errors caused by the diminishing action space and significantly improve the performance of reinforcement learning recommendation algorithms.In addition,the experiments also reveal that Q-ADER can better differentiate users’ preferences for different items due to its higher Q-value variance,thus producing better recommendation strategies. |