Evaluation Function And Algorithm Design Of Bandit Problem From The Perspective Of Best-individual-search

Posted on:2022-08-10

Degree:Master

Type:Thesis

Country:China

Candidate:J P Xia

Full Text:PDF

GTID:2480306740478144

Subject:Probability theory and mathematical statistics

Abstract/Summary:

PDF Full Text Request

Bandit problem has been a hot research topic in recent decades,and many learning algorithms such as ad recommendation and Monte Carlo tree search use the idea of Bandit problem.The classical Bandit problem studies how to design algorithms to maximize returns when the expected returns of each individuals are completely unknown and only one individual can be chosen at a time.The most commonly used algorithms are Thompson sampling and UCB1.Although researchers have studied various extensions and applications in various scenarios of the Bandit problem,the goal of most of them is to maximize the global return,and the evaluation function is Regret proposed by Lai and Robbins.However,many algorithms such as Alphago hope to find the optimal choice among multiple choices rather than maximizing the global return.These two problems are not equivalent.This article will study the Bandit problem from the perspective of optimal choice.This problem is called the“best-individual-search problem”.We show that Regret is not a valid evaluation function in the perspective of best-individual-search,and propose a valid evaluation function:True Judgement Probability.Specifically,assuming that expected returns of individuals are p₁,p₂,...,p_K,their prior distribution isπ,and the ith individual has the largest expected return,then the True Judgment Probability refers to the posterior probability that piis greater than any other pjunder the conditions given by the algorithm A and the total number of trialsn.The main text will focus on the True Judgement Probability.The True Judgement Probability is also called“True Judgement Probability of Algorithms（TJPA）”,and the corresponding evaluation function of samples is called“True Judgement Probability of Samples（TJPS）”.TJPS is a way to estimate TJPA by experiment.This paper first proves that TJPA reaches its maximum value（also called theoretical optimal solution）under a static algorithm with optimal allocation which only exists in theory,then start to derive methods of finding the theoretical optimal solution.For general distribution families of returns,the constrained optimization problem used to find the theoretical optimal solution is simplified by the large sample property,the expectation formula of which is approximated to the form of cross entropy,thereby the objective function is changed into a more concise K multiple integral,and an iterative algorithm based on the quasi-Newton method is given.Then for the two-point distribution family and the Gaussian distribution family,the objective function of the constrained optimization problem can be approximated as K one-fold integral multiplication,and then a simplified set of equations for solving the constrained optimization problem of these two distributions is derived through Lagrange multiplier method and the approximate calculation of the cumulative distribution function,which doesn’t have any integrals.Numerical simulations show that the solving speed of the equations is tens of thousands times faster than the quasi-Newton method,and the accuracy is almost the same as that of the quasi-Newton method.Applying the equation set method of the theoretical optimal solution,a new algorithm for the best-individual-search Bandit problem is given in the article:Opt-Alloc algorithm.The average values of Opt-Alloc,UCB1,and Thompson sampling are compared in numerical simulations after 500 runs of each of the three algorithms,and the results show that the Opt-Alloc algorithm significantly outperforms the UCB1 algorithm and the Thompson sampling algorithm.

Keywords/Search Tags:

Approximation of Beta distribution, Bandit problem, Evaluation function, Opt-Alloc, Theoretical optimal solutions, True Judgement Probability

PDF Full Text Request

Related items

1	Research On Approximation Properties Based On Beta Basic Function And Its Application In Regression Analysis Model
2	Function approximation methods for optimal control problems
3	Least-Squates Solutions Of Inverse Problems For Two Class Of Matrices And Their Optimal Approximation
4	A Kind Of Optimal Stopping Under Probability Distortion
5	Research On Some Optimal Investment Problems Under Nonconcave Utilities And Probability Distortions
6	Improvement And Application Of Grey Model Based On Beta And KPCA
7	Some Matrix Equations Problems Under Special Linear Constraint And Their Optimal Approximation Problems
8	The Least-squares Problem Of Centrosymmetric Matrices And Its Approximation
9	A Note On The Estimate Of The Beta Distribution
10	Reserach On Optimal Dividend Strategy And Ruin Probability Of Insurance Companies With Uncertain Income