Dynamic Allocation Of Non Standard Multi-armed Bandit

Posted on:2017-02-28

Degree:Doctor

Type:Dissertation

Country:China

Candidate:W Q Bao

Full Text:PDF

GTID:1220330485972984

Subject:Probability theory and mathematical statistics

Abstract/Summary:

PDF Full Text Request

The main purpose of this paper is to extend multi-armed bandit models with index policy, making it more realistic to the background, such as, each arm has its own restricted switching times; each arm has its own discount process; preemptive-reapt breakdowns with incomplete information. To do this, we need to develop firstly the related theories on optimal stopping problems and nonparametric Bayesian methods.In view of the above purpose, full details are as follows:To discuss the problems of optimal stopping times to a regular set of variables indexed by partially available stopping times, we use classical probability theory to give general conclusions. It subsumes the classical framework in continuous-time, discrete-time, as well as semi-Markov settings as special cases, which is as follows. Firstly, on the set of variables with a single index of stopping times, we introduce the set of allowable stopping time to establish optimal stopping model with restrictions, characterize the two families of value function, establish necessary and sufficient conditions for the existence of optimal stopping times, characterize the minimal and the maximal optimal stopping times, and discuss the nature of the value of local variables family, regularity and the like; Secondly, the restricted optimal stopping problems are extended to the case with double indices, of which the results can be generalized to the multi-index case; Thirdly, as a by-product, we solve the properties of the countable decomposition of a accessible set.In dynamic allocation of MAB in continuous time, we deal with the case that each of the stochastically independent arms is associated with a random time set and switching from the arm to another is allowed only when the processing time of this arm belongs to that restriction set, in order to maximize the total expected present value of the bandit system. By introducing restricted time set, the classical theory of optimal stopping times to the stochastic process is first generalized to allow restricted stopping times, and with the ideas of Kaspi and Mandelbaum (1998), the results are then applied to deduce the relation between Gittins index process for a single arm with restricted stopping times and its the instant reward rate process. At last the optimality of an Gittins index policy is verified using excursion method of Kaspi and Mandelbaum (1998). New techniques are also introduced so that the new proof is drastically simpler than the ones in the literature.Furtherly, the extended model with a non-uniform discount process of each arm is considered. After selecting two kinds of expected total discounted reward, we define their appropriate indices by the theory of optimal restricted stopping times in continuous time, and draw the conclusion that one of index policies is to maximize its objective function, while the other is not.As to stochastic scheduling subject to preemptive-repeat breakdown with incomplete information, by using Bayesian method and selecting the expected discounted reward (EDR) as the objective function, we discuss the characteristics of the optimal strategy index under static policies and dynamic dynamic policies respectively, especially one-step reward rates of dynamic policy of different Bayesian frameworks. To the static policy, the results of the general framework is similar to that of the parametric frame; as for dynamic policy, by analyzing the relationship between one-step rewards and Bayesian frameworks, we find the different frameworks of Bayesian method have different impact on the Gittins index.

Keywords/Search Tags:

Optimal stopping times, restricted stopping times, Snell’s enve- lope, Multi-armed bandit process, Gittins index, retirement method, excursion theory, Bayesian Method

PDF Full Text Request

Related items

1	Stopping times and confidence bounds for small-sample stochastic approximation algorithms
2	Optimal Stopping Based On Spectrum Decompo- Sion For Running Extreme Value Of Levy Pro- Cess With Caps
3	The Erlang (k) The Bandit Sampling Process
4	Approximative Optimal Multistopping Times Of Discrete Time Stochastic Processes
5	Optimal stopping and weak convergence methods for some problems in financial economics
6	Theory Of The Existence And Boundedness For Stochastic Differential Equations
7	A Result And The Application In Investment Decision-making Of Optimal Stopping
8	Investigations On Consumer Strategy And Manufacturer Strategy And Their Equilibrium Under Multi-Armed Bandit Models
9	A Study Of Statistical Properties On Stock Price Model By Statistical Physics Theory
10	Some Optimal Stopping Strategies Under Distorted Probability