Font Size: a A A

Data Efficient Optimization Algorithms For Reinforcement Learning

Posted on:2021-05-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y R LiFull Text:PDF
GTID:2518306104488294Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,deep reinforcement learning has seen tremendous success in applications with huge state space such as computer Go,video games,and robotics.This is due to the strong function approximation ability of deep neural networks as well as a powerful simulator with computing resources because simulator with enough computing resources could generate unlimited interaction data between the agent and the environment.However,in many realworld applications,such as recommender systems,logistics,energy management,and real-world robotics,data collection is at high expense and a low frequency.Therefore,sample efficiency is one of the key algorithmic issues in(deep)reinforcement learning for real-life applications.To encompass problems in data-scarce scenarios,i.e.,the agent is allowed to interact with the environment to collect new data but at a low-frequency,efficient reuse of the off-policy data is necessary.However,standard state-of-the-art policy gradient algorithms do not handle offpolicy data well,leading to premature convergence and instability.We introduce the divergence augmented policy optimization algorithms with application to this data-scarce scenarios.The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to ensure small and safe policy updates with off-policy data.The Bregman divergence is calculated between the state-action joint distributions of two policies,instead of only on the action distributions,leading to a divergence augmentation formulation which encourages deeper explorations.Our proposed methods stabilize policy optimization when off-policy data are reused,leading to faster convergence to better policy and a significant improvement in data efficiency.In Arcade Learning Environments(ALE),our algorithm outperforms the state-of-the-art Proximal Policy Optimization(PPO)method significantly.We also did theoretical convergence analysis for our off-policy policy optimization method: we give the closed-form solution in direct search setting and prove the local convergence for parameterized optimization setting.
Keywords/Search Tags:Reinforcement Learning, Policy Optimization, Off-policy, Sample efficiency
PDF Full Text Request
Related items