On multi-armed bandit in dynamic systems

Posted on:2011-09-27

Degree:Ph.D

Type:Thesis

University:University of California, Davis

Candidate:Liu, Keqin

Full Text:PDF

GTID:2448390002967486

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

Multi-armed bandit (MAB) is a classical problem in stochastic optimization with a wide range of engineering applications. The first MAB problem was proposed in 1933 for the application of clinical trial. The problem, however, remained open for over 40 years until the breakthrough by Gittins in 1974. Under a Bayesian formulation, Gittins proved that an index policy is optimal, thus reducing the complexity of finding the optimal policy from exponential to linear with the system size. In 1985, Lai and Robbins established the optimal policy for the MAB under a non-Bayesian formulation. Since these milestones, MAB have attracted numerous research efforts on generalizing the associated mathematical theory and broadening the applications. In this thesis, we present our contributions to the basic theories of both the Bayesian and the non-Bayesian frameworks of MAB motivated by engineering problems in dynamic systems.;For the Bayesian framework, we address an important and still largely open extension of the classic MAB---the Restless Multi-Armed Bandit (RMAB). In 1988, Whittle generalized the classic MAB to RMAB that considers the scenario where the system dynamics cannot be directly controlled. This generalization significantly broadens the application area of MAB but renders Gittins index policy suboptimal. As shown by Papadimitriou and Tsitsiklis, finding the optimal solution to an RMAB is PSPACE-hard in general. Whittle proposed a heuristic index policy with linear complexity, which was shown to be asymptotically (when the system size, i.e., the number of arms, approaches infinity) optimal under certain conditions by Weber and Weiss in 1990. The difficulty of implementing Whittle index policy lies in the complexity of establishing its existence (the so-called indexability), computing the index, and establishing its optimality in the finite regime. The study of Whittle index policy often relies on numerical calculation that is infeasible for RMAB with infinite state space. In this thesis, we show that for a significant class of RMAB with an infinite state space, the indexability can be established, Whittle index can be obtained in closed-form, and, under certain conditions, achieves the optimal performance with a simple semi-universal structure that is robust against model mismatch and variations. To our best knowledge, this appears to be the first nontrivial RMAB for which Whittle index policy is proven to be optimal for a finite-size system. This class of RMAB finds a broad range of applications, from dynamic multichannel access in communication networks to bio/chemical monitoring systems, from target tracking/collecting in multi-agent systems to resource-constrained jamming/anti-jamming, from network anomaly detection to supervisory control systems. Furthermore, our approach to establishing the indexability, solving forWhittle index and characterizing its optimality is not limited to this class of RMAB and provides a set of possible techniques for analyzing the general RMAB.;For the non-Bayesian framework, we extend the classic MAB that assumes a single player to the case of multiple distributed players. Players make decisions solely based on their local observation and decision histories without exchanging information. We formulate the problem as a decentralized MAB under general reward, observation, and collision models. We show that the optimal performance (measured by system regret) in the decentralized MAB achieves the same logarithmic order as that in the classic centralized MAB where players act collectively as a single entity by exchanging observations and making decisions jointly. Based on a Time Division and Fair Sharing (TDFS) structure, a general framework of constructing order-optimal and fair decentralized polices is proposed. The generality of the TDFS framework leads to its wide applications to distributed learning problems in multi-channel communication systems, multi-agent systems, web search and internet advertising, social networks, etc.

Keywords/Search Tags:

MAB, Systems, Bandit, Applications, Problem, Index policy, Dynamic, Optimal

PDF Full Text Request

Related items

1	Essays on sequential analysis: Multi-armed bandit with availability constraints and sequential change detection and identification
2	Research On Underwater Data Collection Algorithms In Uasns Based On Rbp System Model
3	Nouiehed-Ross’s Conjecture,Berry’s Conjecture For Two-Armed Bandit And Their Applications
4	Performance guarantee of a sub-optimal policy for a discrete Markov decision process and its application to a robotic surveillance problem
5	Research On Real-time Online Recommendation Method In Recommendation System
6	Research On Optimal Selection Strategy Of Search Engine Keywords Based On Multi-armed Bandit
7	Three Kinds Of Optimal Consensus Control Algorithms For Nonlinear Multi-Agent Systems
8	A Dynamic Pricing Algorithm Of Niche Goods Based On Multi-Armed Bandit Model
9	Two-Armed Bandit Problems And Related Limit Theorems
10	Research On Optimal Control For Nonlinear Systems Based On Reinforcement Learning