Research On Proximal Parameter-based Policy Optimization Algorithm

Posted on:2023-03-20

Degree:Master

Type:Thesis

Country:China

Candidate:J X Yang

Full Text:PDF

GTID:2568307058963839

Subject:Control engineering

Abstract/Summary:

PDF Full Text Request

Reinforcement learning is an important learning method in the field of machine learning.It mainly studies how an agent can make better decisions according to the current environment.It is one of the most promising research fields to achieve the goal of artificial intelligence,and it is also a research hotspot for intelligent system developers.In the field of reinforcement learning,Policy Gradient algorithm is practical and easy to implement,and is considered to be the mainstream method to deal with complex decision-making tasks in continuous space.However,these algorithms have the problems of large gradient estimation variance and unstable policy update.Policy Gradients with Parameter-based Exploration(PGPE)proposed an action deterministic strategy and the idea of random sampled in the prior distribution of target parameters,which effectively improved the stability of the policy gradient algorithm in complex environments.However,in complex and unknown environments,reinforcement learning algorithms require a large number of learning samples to obtain stable training results.Due to the particularity of the physical system,it is very difficult to collect a large number of interactive learning samples,which requires a high cost of manpower,material resources and time.Therefore,sample utilization is a bottleneck in the practical application of reinforcement learning.This thesis addressed the problems of poor stability and low sample utilization of reinforcement learning algorithms in complex continuous spaces,and proposed a Proximal Parameter-based Policy Optimization(PPPO)algorithm.Specifically,used the PGPE algorithm framework,introduced proximal policy optimization idea,baseline sampling and symmetrical sampling technology,it aims to reduced unnecessary randomness by adopted a deterministic strategy in an environment with limited sampling,and improved the training effect of the agent by reused old samples without increased the variance of the strategy gradient.This not only solved the problem of unstable training effect in a high-dimensional environment,but also solved the problem of sample utilization.Finally,this thesis verified the effectiveness of the PPPO algorithm through robot control experiments in low-dimensional continuous space.Furthermore,this thesis applied the PPPO algorithm to control the intelligent robot control task in high-dimensional space.The experimental results showed that the algorithm had better convergence quality and better performance.It can solved the problems of low sample utilization and unstable policy gradient estimation variance existing in the above reinforcement learning algorithm.

Keywords/Search Tags:

Reinforcement Learning, Policy Gradient, Importance sampling technique, Policy Gradients with Parameter-based Exploration, PPO algorithm

PDF Full Text Request

Related items

1	Robust Policy Gadient Algorithm Based On Actor-Critic In Deep Reinforcement Learning
2	Research On Policy Gradient Methods Based On Functional Gradients
3	Research On Regularized Policy Gradient
4	Exploration Strategy Of Deterministic Policy In Deep Reinforcement Learning
5	Research On Twin Delayed Deep Deterministic Policy Gradient Based On Augmented Exploration
6	Deep Deterministic Policy Gradient Based On Entropy Regularization And Regular Update
7	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
8	Research On Continuous Robot Control Algorithms Based On Reinforcement Learning
9	Bound Action Policy For Reinforcement Learning Exploration
10	Research On Overestimation And Safety In Reinforcement Learning