Font Size: a A A

Thompson Sampling for Bandit Problem

Posted on:2019-03-08Degree:Ph.DType:Thesis
University:Princeton UniversityCandidate:Liu, Che-YuFull Text:PDF
GTID:2475390017485332Subject:Artificial Intelligence
Abstract/Summary:
Bandit problems are the most basic examples of the sequential decision making problems with limited feedback and an exploitation/exploration trade-off. In these problems, an agent repeatedly selects an action out of a pool of candidates and receives a reward sampled from the action's reward distribution. The agent's goal is to maximize the sum of rewards that he receives over time. The trade-off at each time step is between exploitation of the actions that already produced high rewards in the past and exploration of the poorly understood actions that have the potential to yield even higher rewards in the future. Bandit problems arise naturally in many applications, such as clinical trials, project management and online news recommendation.;Thompson Sampling is a popular strategy to solve bandit problems. It selects actions using the "probability matching" principle. In the first part of this thesis, we analyze Thompson Sampling from several different angles. First, we prove a tight bound on Thompson Sampling's performance when the performance is averaged with respect to the prior distribution that Thompson Sampling uses as input. Next, we look at the more realistic non-averaged performance of Thompson Sampling. We quantify the sensitivity of Thompson Sampling's (non-averaged) performance to the choice of input prior, by providing matching upper and lower bounds. Finally, we illustrate Thompson Sampling's ability to optimally exploit prior knowledge by thoroughly analyzing its behavior in a non-trivial example.;In the second part of this thesis, we switch our focus to the most-correlated-arms identification problem, where the actions' reward distributions are assumed to be jointly Gaussian and the goal is to find actions with the most mutually correlated rewards. In this problem and unlike in bandit problems, we focus on exploring the actions to acquire as much relevant information as possible and only exploit the acquired information at the end to return the set of correlated actions. We propose two adaptive action-selection strategies and show that they can have significant advantages over the non-adaptive uniform sampling strategy. Our proposed algorithms rely on a novel correlation estimator. The use of this accurate estimator allows us to get improved results for a wide range of problem instances.
Keywords/Search Tags:Problem, Thompson sampling, Bandit
Related items