Font Size: a A A

The Empirical MAB Strategies With Markov Structure

Posted on:2023-03-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiFull Text:PDF
GTID:1520307139999599Subject:Statistics
Abstract/Summary:PDF Full Text Request
In recent years,multi-armed bandit problems(MAB)have attracted wide attention,because they are models for many real-world applications,such as online recommendation system,web search and crowdsourcing tasks,etc.The key of the MAB problem is to solve the dilemma of exploration and exploitation,and the right balance between them.Several simple and effective algorithms to solve this problem,for example,the ε-greedy algorithm/strategy,the Upper Confidence Bound(UCB)algorithm/strategy and Thompson Sampling(TS).The research on MAB can be divided into two branches,the branch of probability theory and the branch of statistics.In the branch of probability theory,the reward of each arm basically evolves by in Markov mode,or more complex form.However,the research in the branch of statistics basically does not exceed the independent and identically distributed reward framework,in which the environmental information(potential distribution)is unknown to decision-maker.Thus,the strategy of MAB decision process with Markov structure is studied,which is a new problem in reinforcement learning,the topic selection has important theoretical and practical significance.The main work and innovation of this paper are as follows:(1)For more general MAB models in which every arm evolves according to a rewarded Markov process,it is well known the optimal policy is to pull an arm with the highest Gittins index.When the underlying distributions are unknown,an empirical Gittins index rule(strategy)with ε-exploration(abbreviated as empirical ε-Gittinx index rule)is proposed to solve such MAB problems.This procedure is constructed by combining the idea of ε-exploration(for exploration)and empirical Gittins indices(for exploitation)computed by applying the Largest-Remaining-Index algorithm to the estimated underlying distribution.This paper provides the convergence of the empirical Gittins index with the real Gittins index,as well as the expected total discounted rewards(expected return)of the empirical ε-Gittinx index rule with the Oracle Gittins index rule.Eventually,a numerical simulation study is demonstrated to show the behavior of the proposed policies,and its performance over the ε-mean reward is discussed.(2)This paper considers the GUCB-I strategy with discount factor in the infinite horizon,by combining Monte Carlo methods with finite Markov decision processes,which constructs a general MAB problem.Each policy is considered as an arm in the MAB model,or more precisely,as a general arm,and the cumulative discounted reward of the Markov trajectory generated under the strategy is viewed as the immediate reward of this arm.This behavior breaks the traditional perspective of MAB as an MDP to solve the problem,and cleverly solves the open problem of convergence of Monte Carlo Exploring Starts(MCES)algorithm in reinforcement learning,instead of just seeking a partial solution.In this paper,we propose the Generalized Upper Confidence Bound with Infinite Horizon algorithm,that is,a generalized UCB index strategy,and prove that as the number of decision points tends to infinity,the upper confidence bound of the optimal strategy converges almost everywhere to the expected reward of the optimal strategy;the sample means of the rewards obtained under all strategies in the recommended strategy set,also converge almost everywhere to the respective mean rewards.In addition,it is further proved that the expected Regret order of the algorithm reaches the optimal log n order.Finally,numerical experiments are conducted to demonstrate the behavior of the GUCB-I algorithm at different confidence levels and initial distributions,and to compare it with the MCES algorithm in terms of the percentage of optimal actions and the expected Regret,and find that the GUCB-I algorithm outperforms the MCES algorithm.(3)The effective combination of Monte Carlo method and Markov decision process is further considered,and a general MAB problem is constructed in the finite horizon,in which each policy is still regarded as a general arm,and the Markov reward under this strategy is viewed as the immediate reward of this general arm.The open problem of convergence of MCES algorithm in reinforcement learning is also cleverly solved.This paper studies a GeneralUpper Confidence Bound with Finite Horizon(GUCB-F)algorithm in the finite horizon,that is,an another generalized UCB index strategy and prove that the upper confidence bound of the optimal strategy converges almost everywhere to the expected reward of the optimal strategy,as the number of decision points tends to infinity,and the sample means of the rewards under the all strategies in the recommended strategy set converge almost everywhere to the respective mean rewards.Furthermore,the expected Regret obtained by GUCB-I algorithm achieves the logarithmic order.Finally,numerical simulations show the behavior of the proposed strategy at different confidence levels and initial distributions,and analyze its performance.(4)This paper discusses the Gittins index strategy under the UCB framework,and designs the UCB version of the Gittins index strategy for the first time,breaking the traditional research trend of focusing on mean confidence upper bound.Considering that the UCB algorithm can both explore uncertainty and select current promising arms based on historical information,this paper presents an empirical Gittins index rule with δ-exploration,i.e.,constructing a confidence upper bound for Gittins,which is referred to as the empirical UCB-Gittins index rule,by means of the confidence upper bound algorithm,to solve the MAB problem.We apply the Largest-RemainingIndex algorithm to the confidence upper bound of the mean reward and estimated underlying distribution to derive the confidence upper bound of the Gittins index.Finally,the simulation results show the effectiveness of the corresponding empirical strategy.By studying the MAB problem under Markov state transition structure,this paper aims to build a bridge between the above two branches,which is a beneficial exploration with important academic significance.This paper combines the classical ε-exploration and UCB methods in the study of statistical version of MAB with the Gittins index optimal strategy of Markov structured MAB.The strategy of MAB problem with Markov arm structure,when distribution is unknown,is deeply explored.This paper opens up a new research direction for MAB problem,which has certain innovation.
Keywords/Search Tags:Multi-armed bandit, Reinforcement learning, Gittins index, Empirical Gittins index, ε-greedy algorithm, Upper Confidence Bound algorithm
PDF Full Text Request
Related items