Decision-Theoretic Planning For Multi-Agent Systems

Posted on:2012-11-20

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Wu

Full Text:PDF

GTID:1118330335462371

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Planning under uncertainty is one of the fundamental challenges in artificial in-telligence. Decision theory offers a mathematical framework for optimizing decisions in these domains. In recent years, researchers have made rapid progress for decision making in single-agent cases. This enables relatively large problems to be solved by state-of-the-art MDP and POMDP algorithms. However, research of decentralized de-cision making is still in its infancy and existing solutions can only solve very small "toy" problems. As a nature extension of Markov decision theory to multi-agent sys-tems, the DEC-POMDP model has very high computational complexity, namely NEXP-hard. This is not surprising given that agents in multi-agent settings not only have to keep tracking on the state transition of the environment but also the potential behaviors of the other agents. As a result, the joint policy space is huge. Hence how to search over the joint policy space and find the best one is a key issue when solving a large DEC-POMDP. Due to the problem complexity, exact algorithms can solve tiny problems. Therefore, my thesis work focuses on developing effect algorithms for solving general DEC-POMDPs approximately. Generally, planning in multi-agent systems can work either in online or offline fashions. This thesis contributes to the literature by proposing both online and offline algorithms. Furthermore, a model-free algorithm is presented for planning when the exact model is not available.Online algorithms do planning when interacting with the environment. During execution time, only very small region of the joint policy space can be visited. Thus, online algorithm can scale better when solving large real-world applications. However, online algorithms have their own constrains. The time for planning is very limited be-cause agents have to react to the environment immediately. In DEC-POMDP settings, each agent can only get its own local observations. Hence a distributed planning frame-work is required to guarantee the coordination among agents. In order to cooperate with others, agents must reason about all possible information held by others and this type of information grows exponentially over time. The communication resource is often limited due to the band-width, environment and computation device. Therefore online algorithms must optimize the usage of communication during planning. In this thesis, a novel approach called MAOP-COMM was proposed to handle these challenges.Offline algorithms compute a complete plan prior to interact with the environment. The key advantages are that there is no limit on planning time and planning can take in a centralized manner as long as the outcome policy can be executed by each agent accord-ing its local information. Currently, the leading approaches for solving DEC-POMDPs offline combine dynamic programming with heuristic search to construct policies. The bottleneck is that the policies trees built at each step grow exponentially and run out of time and memory very quickly. To address this challenge, this thesis improves the existing work and proposes two novel solutions called PBPG and TBDP. The contribu-tion of PBPG rely on constructing the best policy for each belief point directly instead of enumerating all candidates and selecting the best one. TBDP is developed to address the challenge of the large state space in DEC-POMDPs. The main contribution is to use trail-based evaluation for reachable states and only do the computation when necessary. The complete knowledge of a model is required by both offline and online algorithms. Unfortunately, the exact form of a DEC-POMDP is not always available. Therefore, it is significant to develop a learning algorithm that can compute a policy by only in-teracting with the environment. Given the motivation above, a Monte-Carlo algorithm called DecRSPl is proposed to learn a policy using rollout samples drew from the envi-ronment. DecRSPI is model-free and requires only a simulator or an environment that can be sampled.The contribution of this thesis to the multi-agent planning community is main-ly fourfold:(1) It systematically studied the multi-agent online planning problem and proposed the MAOP-COMM algorithm that guarantees coordination among agents. MAOP-COMM has three key components:a fast policy search method based on lin-ear programming to meet the online time-constraint, a policy-based merging solution for histories to identify the most useful information while bound the usage of memory and a new communication strategy by detecting the inconsistency of the beliefs to make a bet-ter use of the communication resource. In the experiments, MAOP-COMM performed very well in a variety of testing domains. (2) It systematically studied the policy genera-tion problem in multi-agent offline planning and proposed the PBPG algorithm. PBPG completely replaces the backup operation and formulates the policy generation problem as an optimization for finding the best mapping. An approximate algorithm was also proposed to find the mapping efficiently. Consequentially, more policy trees can be kept as building blocks of the next iteration and the total solution quality is greatly improved. In the experiments, PBPG achieved an order of magnitude improvement in runtime and get near-optimal solutions when the number of sub-policies is sufficiently large. (3) It systematically studied the policy evaluation problem in multi-agent offline planning and proposed the TBDP algorithm. TBDP uses trail-based evaluation for reachable s- tates and only does the computation when necessary. A new policy representation with layered structure and stochastic decision nodes was introduced. It formulates the pol-icy construction as an optimization over the parameter space and speeds up the policy generation process. Besides, TBDP can be implemented in a distributed manner and lead itself the computational power of multi-core computers. In the experiments, TBD-P solved a problem with more than ten thousand states. (4) It introduced the model-free technique for multi-agent planning and proposed the DecRSPI algorithm. DecRSPI is a Monte-Carlo algorithm and requires only a simulated environment to learn a policy using rollout samples drew from the simulator. An important property of DecRSPI is its linear time and space complexity over the agent size. In the experiments, DecRSPI can solve a problem up to 20 agents, an order of magnitude improvement over the agent size for state-of-the-art algorithms.

Keywords/Search Tags:

Multi-Agent Systems, Markov Decision Model, Planning Under Uncertainty, Coordination and Cooperation, Decentralized Partially-Observable Markove Decision Process

PDF Full Text Request

Related items

1	Increasing scalability in algorithms for centralized and decentralized partially observable Markov decision processes: Efficient decision-making and coordination in uncertain environments
2	Markov Theory Based Planning And Sensing Under Uncertainty
3	Research On Path Planning Based On Markov Decision Process For AUV
4	Heuristic Learning Model Based On Partially Observable Markov Decision Process
5	Research On Path Planning Based On Markov Decision Processes For Palletizing Robot
6	Hierarchical learning and planning in partially observable Markov decision processes
7	Research On Agent Decision Problem Based On Markov Decision Process Theory
8	Deep Value Iteration Network For Partially Observable Markov Decision Process
9	Research On Soccer Robots Cooperation Mechanism Based On Markov Decision Theory
10	Model learning and application of partially observable Markov decision processes