Font Size: a A A

Evaluation Function And Algorithm Design Of Bandit Problem From The Perspective Of Best-individual-search

Posted on:2022-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:J P XiaFull Text:PDF
GTID:2480306740478144Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Bandit problem has been a hot research topic in recent decades,and many learning algorithms such as ad recommendation and Monte Carlo tree search use the idea of Bandit problem.The classical Bandit problem studies how to design algorithms to maximize returns when the expected returns of each individuals are completely unknown and only one individual can be chosen at a time.The most commonly used algorithms are Thompson sampling and UCB1.Although researchers have studied various extensions and applications in various scenarios of the Bandit problem,the goal of most of them is to maximize the global return,and the evaluation function is Regret proposed by Lai and Robbins.However,many algorithms such as Alphago hope to find the optimal choice among multiple choices rather than maximizing the global return.These two problems are not equivalent.This article will study the Bandit problem from the perspective of optimal choice.This problem is called the“best-individual-search problem”.We show that Regret is not a valid evaluation function in the perspective of best-individual-search,and propose a valid evaluation function:True Judgement Probability.Specifically,assuming that expected returns of individuals are p1,p2,...,pK,their prior distribution isπ,and the ith individual has the largest expected return,then the True Judgment Probability refers to the posterior probability that piis greater than any other pjunder the conditions given by the algorithm A and the total number of trialsn.The main text will focus on the True Judgement Probability.The True Judgement Probability is also called“True Judgement Probability of Algorithms(TJPA)”,and the corresponding evaluation function of samples is called“True Judgement Probability of Samples(TJPS)”.TJPS is a way to estimate TJPA by experiment.This paper first proves that TJPA reaches its maximum value(also called theoretical optimal solution)under a static algorithm with optimal allocation which only exists in theory,then start to derive methods of finding the theoretical optimal solution.For general distribution families of returns,the constrained optimization problem used to find the theoretical optimal solution is simplified by the large sample property,the expectation formula of which is approximated to the form of cross entropy,thereby the objective function is changed into a more concise K multiple integral,and an iterative algorithm based on the quasi-Newton method is given.Then for the two-point distribution family and the Gaussian distribution family,the objective function of the constrained optimization problem can be approximated as K one-fold integral multiplication,and then a simplified set of equations for solving the constrained optimization problem of these two distributions is derived through Lagrange multiplier method and the approximate calculation of the cumulative distribution function,which doesn’t have any integrals.Numerical simulations show that the solving speed of the equations is tens of thousands times faster than the quasi-Newton method,and the accuracy is almost the same as that of the quasi-Newton method.Applying the equation set method of the theoretical optimal solution,a new algorithm for the best-individual-search Bandit problem is given in the article:Opt-Alloc algorithm.The average values of Opt-Alloc,UCB1,and Thompson sampling are compared in numerical simulations after 500 runs of each of the three algorithms,and the results show that the Opt-Alloc algorithm significantly outperforms the UCB1 algorithm and the Thompson sampling algorithm.
Keywords/Search Tags:Approximation of Beta distribution, Bandit problem, Evaluation function, Opt-Alloc, Theoretical optimal solutions, True Judgement Probability
PDF Full Text Request
Related items