Font Size: a A A

Online Reinforcement Learning Study Based On Posterior Sampling

Posted on:2024-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:S Q ShiFull Text:PDF
GTID:2568307079960279Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since 21st century,the hardware level of computers has ushered in a qualitative leap,which is undoubtedly a major benefit for machine learning methods since they require a large amount of computing resources.In machine learning field,reinforcement learning is particularly favored by researchers for its simple and efficient learning form.Before the surgence of reinforcement learning,it was difficult for people to obtain an efficient and accurate optimal solution to the strategy optimization problem in the field of online control.There are three main difficulties:first,the strategy model of the controlled system is unknown,and it is difficult to determine.Second,the online control system has high requirements on the accuracy of model input and output,while traditional algorithms are often difficult to achieve the target;third,the trial and error cost of the algorithm is high,and online control requires the algorithm to make real-time exploration and analysis of the current environment.This requires the algorithm to have a good balance between exploration and exploitation.Using traditional estimation algorithms to estimate models often results high bias and high variance,which is difficult to apply to practical problems.For reinforcement learning,by modeling the environment model,it can better understand the world model,thus giving the policy model a relatively complete prior.The strategy optimization method based on reinforcement learning can achieve par or even far exceed the performance of traditional control algorithms in the field of online control.Theoretically,the policy optimization method of reinforcement learning also has remarkable advantages comparing to other algorithms.In order to solve more complex real-world policy optimization problems,this paper proposes a globally optimal policy generation system based on posterior sampling.In view of the high cost of trial and error in the field of online control and the difficulty of learning policy models,this paper uses the posterior sampling method to balance the weights of exploration and utilization based on global and instantaneous rewards.The reliability and superiority of the system are verified from a mathematical point of view,and a better solution is proposed on the upper regret bound of the general online reinforcement learning.The contributions and innovations of this paper mainly include:·We propose the reward weight-based posterior sampling method(RWPSP)in this thesis.This is the first framework that breaks away from the traditional transition function-based sampling.We creatively apply the posterior sampling method to the policy distribution,and update the policy distribution based on long-term and short-term rewards.·This paper theoretically proves the convergence and superiority of the algorithm.The regret upper bound of this algorithm is O(?),where C/S2<(?).This result is better than the best regret upper bound(?)currently known.·From the empirical aspect,this paper is equal to or even better than the current best algorithm in various open source reinforcement learning environments,especially in the environment with high-dimensional space.
Keywords/Search Tags:Online Reinforcement Learning, Regret Bound, Posterior Sampling, Policy optimization
PDF Full Text Request
Related items