Font Size: a A A

Research On Deep Reinforcement Learning Algorithm For Continuous Action Control

Posted on:2024-03-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:M LiFull Text:PDF
GTID:1528307301477214Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the crucial way to achieve artificial intelligence,deep reinforcement learning(DRL)combines the powerful perception of deep learning and the excellent decisionmaking ability of reinforcement learning,and it has created fruitful results in solving typical real-world tasks.Among them,research on DRL related to robot control,intelligent driving,and other continuous action control is in the ascendant.DRL can effectively achieve optimal continuous action control by optimizing control policy,which has garnered significant attention from both industry and academia.Despite relevant research in full swing,some issues need further research in the existing DRL algorithms for continuous action control.This dissertation focuses on researching four core issues in reaching exploration-exploitation trade-off,achieving sufficient exploration,improving exploitation efficiency,and dealing with insufficient state observation by proposing corresponding solutions to these challenges.The research content of this dissertation is as follows.An adaptive exploration policy(AEP)is proposed for the exploration-exploitation trade-off.Most of the existing DRL algorithms for continuous action control construct exploration policy by adding noise to the deterministic policy.In this way,the adaptability of exploration is low because the added noise is sampled from a fixed distribution.However,a too-large or too-small exploration scale will lead to an imbalance of exploration and exploitation.This dissertation proposes AEP to solve this problem by adjusting the exploration scale according to the training stability.It increases the noise scale when the training stability is high,while it reduces the noise scale when the training stability is low.Through theoretical analysis and experiments,it can be proven that the DRL algorithms based on AEP can effectively reach the exploration-exploitation trade-off.An exploration network policy(ENP)is proposed to achieve sufficient exploration.The noise sampled from random distribution cannot guarantee that all of the important environmental information will be explored,which may lead to insufficient exploration.This dissertation proposes ENP to address this problem.ENP guides the agent to explore in the direction of increasing the sample diversity to avoid the local optimum caused by insufficient exploration.To be specific,it trains an exploration network to generate the exploration direction that can increase the sample diversity.It also adjusts the exploration scale as AEP has done.Theoretical analysis and experimental results indicate that the DRL algorithm based on ENP can effectively achieve sufficient exploration.A clustering experience replay(CER)is proposed to improve exploitation efficiency.Most of the existing DRL algorithms for continuous action control use experience replay to exploit environmental information,which replays the samples explored in the interaction between the agent and environment by uniform sampling.However,it cannot guarantee that all kinds of samples are replayed sufficiently.Therefore,the agent’s capture of environmental information may not be comprehensive,leading to low exploitation efficiency.This dissertation proposes CER to solve this problem.It sufficiently mines environmental information in all kinds of samples by considering the similarity between samples.CER clusters samples in a divide-and-conquer framework based on time to divide the samples into different kinds with minimal cost.Moreover,it constructs a conditional probability density function to ensure that each kind of sample can be sufficiently replayed.Theoretical analysis and experiments show that the exploitation efficiency of DRL algorithms based on CER is effectively improved compared to existing algorithms.A multi-view decision process(Mv DP)is proposed to solve the insufficient state observation.The existing DRL algorithms generally achieve mathematical modeling by Markov decision processes where the effect of action is only decided by the current state.But this is reasonable only if the state is correctly defined and the state is sufficiently observed.Thus the existing DRL algorithms are not suitable for the case of insufficient state observation.To solve this problem,this dissertation proposes Mv DP where the explored sample is considered from the views of history,present,and future.In Mv DP,historical information is used to compensate for the lack of state information.Based on Mv DP,a multi-view DRL algorithm is proposed.Theoretical analysis and experimental results prove that the new algorithm is effective in the case of insufficient state observation.In summary,this dissertation focuses on the four core issues in DRL algorithms for continuous action control and proposes corresponding effective solutions.AEP addresses the imbalance of exploration and exploitation.ENP solves insufficient exploration.CER deals with low exploitation efficiency.Mv DP handles insufficient state observation.The research result of this dissertation can provide strong theoretical and algorithmic support for the application of DRL algorithms in continuous action control.
Keywords/Search Tags:Deep Reinforcement Learning, Continuous Action Control, Exploration and Exploitation, Clustering Experience Replay, Multi-view Decision Process
PDF Full Text Request
Related items