Thompson Sampling for Bandit Problem

Posted on:2019-03-08

Degree:Ph.D

Type:Thesis

University:Princeton University

Candidate:Liu, Che-Yu

Full Text:PDF

GTID:2475390017485332

Subject:Artificial Intelligence

Abstract/Summary:

Bandit problems are the most basic examples of the sequential decision making problems with limited feedback and an exploitation/exploration trade-off. In these problems, an agent repeatedly selects an action out of a pool of candidates and receives a reward sampled from the action's reward distribution. The agent's goal is to maximize the sum of rewards that he receives over time. The trade-off at each time step is between exploitation of the actions that already produced high rewards in the past and exploration of the poorly understood actions that have the potential to yield even higher rewards in the future. Bandit problems arise naturally in many applications, such as clinical trials, project management and online news recommendation.;Thompson Sampling is a popular strategy to solve bandit problems. It selects actions using the "probability matching" principle. In the first part of this thesis, we analyze Thompson Sampling from several different angles. First, we prove a tight bound on Thompson Sampling's performance when the performance is averaged with respect to the prior distribution that Thompson Sampling uses as input. Next, we look at the more realistic non-averaged performance of Thompson Sampling. We quantify the sensitivity of Thompson Sampling's (non-averaged) performance to the choice of input prior, by providing matching upper and lower bounds. Finally, we illustrate Thompson Sampling's ability to optimally exploit prior knowledge by thoroughly analyzing its behavior in a non-trivial example.;In the second part of this thesis, we switch our focus to the most-correlated-arms identification problem, where the actions' reward distributions are assumed to be jointly Gaussian and the goal is to find actions with the most mutually correlated rewards. In this problem and unlike in bandit problems, we focus on exploring the actions to acquire as much relevant information as possible and only exploit the acquired information at the end to return the set of correlated actions. We propose two adaptive action-selection strategies and show that they can have significant advantages over the non-adaptive uniform sampling strategy. Our proposed algorithms rely on a novel correlation estimator. The use of this accurate estimator allows us to get improved results for a wide range of problem instances.

Keywords/Search Tags:

Problem, Thompson sampling, Bandit

Related items

1	Simple 3d Graphics And Its Corresponding Entity Sampling Strategy
2	The Study Of Famine Leading To Social Problems In Northwest Ethnic Areas In The Republic Of China Period
3	The Developmental Of Recruiting Win-stay & Loss-switch Strategies In Two-armed Bandit Problem
4	Talk Of Literature Nerration Of Bandit Novels Of NewPeriod
5	Bandit Chieftain Liu Guitang Study
6	A Study Of Bandit Narratives In Fiction Of The 1930s And 1940s
7	Discusses And Analyzes About Shandong Bandit During China Republic Period
8	The "Bandit" Color Imagination Of Literature In The 20^th Century
9	Thompson Cultural Studies
10	On E.P.Thompson's Critique On Althusser