Data Efficient Optimization Algorithms For Reinforcement Learning

Posted on:2021-05-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y R Li

Full Text:PDF

GTID:2518306104488294

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In recent years,deep reinforcement learning has seen tremendous success in applications with huge state space such as computer Go,video games,and robotics.This is due to the strong function approximation ability of deep neural networks as well as a powerful simulator with computing resources because simulator with enough computing resources could generate unlimited interaction data between the agent and the environment.However,in many realworld applications,such as recommender systems,logistics,energy management,and real-world robotics,data collection is at high expense and a low frequency.Therefore,sample efficiency is one of the key algorithmic issues in(deep)reinforcement learning for real-life applications.To encompass problems in data-scarce scenarios,i.e.,the agent is allowed to interact with the environment to collect new data but at a low-frequency,efficient reuse of the off-policy data is necessary.However,standard state-of-the-art policy gradient algorithms do not handle offpolicy data well,leading to premature convergence and instability.We introduce the divergence augmented policy optimization algorithms with application to this data-scarce scenarios.The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to ensure small and safe policy updates with off-policy data.The Bregman divergence is calculated between the state-action joint distributions of two policies,instead of only on the action distributions,leading to a divergence augmentation formulation which encourages deeper explorations.Our proposed methods stabilize policy optimization when off-policy data are reused,leading to faster convergence to better policy and a significant improvement in data efficiency.In Arcade Learning Environments(ALE),our algorithm outperforms the state-of-the-art Proximal Policy Optimization(PPO)method significantly.We also did theoretical convergence analysis for our off-policy policy optimization method: we give the closed-form solution in direct search setting and prove the local convergence for parameterized optimization setting.

Keywords/Search Tags:

Reinforcement Learning, Policy Optimization, Off-policy, Sample efficiency

PDF Full Text Request

Related items

1	Robust Policy Gadient Algorithm Based On Actor-Critic In Deep Reinforcement Learning
2	Research On Policy-Constrained Reinforcement Learning
3	Research On Multiagent Policy Optimization Based On Deep Reinforcement Learning
4	Research On Multiagent Cooperation And Applications Based On Reinforcement Learning
5	Deep Reinforcement Learning Based On Policy Gradient Optimization And Its Application In Agent Control
6	Research On Execution-time Policy Evaluation And Policy Evolution In Open Environments
7	Analysis And Research On Off-policy Algorithms In Reinforcement Learning
8	Deep Reinforcement Learning With Self-Generated Expert Samples
9	Research On Fast Policy Gradient Algorithms Of Reinforcement Learning Based On Adaptive Learning Rate
10	Research On Agent Decision-making And Control Based On Deep Reinforcement Learning