Font Size: a A A

Research On Proximal Parameter-based Policy Optimization Algorithm

Posted on:2023-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:J X YangFull Text:PDF
GTID:2568307058963839Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Reinforcement learning is an important learning method in the field of machine learning.It mainly studies how an agent can make better decisions according to the current environment.It is one of the most promising research fields to achieve the goal of artificial intelligence,and it is also a research hotspot for intelligent system developers.In the field of reinforcement learning,Policy Gradient algorithm is practical and easy to implement,and is considered to be the mainstream method to deal with complex decision-making tasks in continuous space.However,these algorithms have the problems of large gradient estimation variance and unstable policy update.Policy Gradients with Parameter-based Exploration(PGPE)proposed an action deterministic strategy and the idea of random sampled in the prior distribution of target parameters,which effectively improved the stability of the policy gradient algorithm in complex environments.However,in complex and unknown environments,reinforcement learning algorithms require a large number of learning samples to obtain stable training results.Due to the particularity of the physical system,it is very difficult to collect a large number of interactive learning samples,which requires a high cost of manpower,material resources and time.Therefore,sample utilization is a bottleneck in the practical application of reinforcement learning.This thesis addressed the problems of poor stability and low sample utilization of reinforcement learning algorithms in complex continuous spaces,and proposed a Proximal Parameter-based Policy Optimization(PPPO)algorithm.Specifically,used the PGPE algorithm framework,introduced proximal policy optimization idea,baseline sampling and symmetrical sampling technology,it aims to reduced unnecessary randomness by adopted a deterministic strategy in an environment with limited sampling,and improved the training effect of the agent by reused old samples without increased the variance of the strategy gradient.This not only solved the problem of unstable training effect in a high-dimensional environment,but also solved the problem of sample utilization.Finally,this thesis verified the effectiveness of the PPPO algorithm through robot control experiments in low-dimensional continuous space.Furthermore,this thesis applied the PPPO algorithm to control the intelligent robot control task in high-dimensional space.The experimental results showed that the algorithm had better convergence quality and better performance.It can solved the problems of low sample utilization and unstable policy gradient estimation variance existing in the above reinforcement learning algorithm.
Keywords/Search Tags:Reinforcement Learning, Policy Gradient, Importance sampling technique, Policy Gradients with Parameter-based Exploration, PPO algorithm
PDF Full Text Request
Related items