Some New Algorithms Of Reinforcement Learning And Their Theoretical Study

Posted on:2007-06-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C M Yin

Full Text:PDF

GTID:1100360218960547

Subject:Operational Research and Cybernetics

Abstract/Summary:

PDF Full Text Request

This dissertation consists of two parts on the whole. We give some new algorithms for reinforcement learning in the first part. Our motivation in this part is to develop existing algorithms that are confronted with some problems such as curse of dimension and computational speed. In the second one, we investigate some theory problems that relative to optimal equations and optimal solutions of reinforcement learning based on the conception of risk-sensitive.In this dissertation, firstly, a new reinforcement learning (RL, abbreviation) algorithm is proposed we refer it to the forgetting algorithm of RL. The idea of this algorithm is based on the following considerations: The updating length of state value functions space about existing algorithms for reinforcement learning are all determined only by the temporal remember of which the times of visiting states. Using those methods, the system must remember all of value functions of states whether look-up tables or function approximators are applied. So, this kind of algorithm will bring on needing of remember capacity increasing exponentially, and therefore computational speed of the algorithm will be decreased exponentially. Based on the consideration above, the principles of forgetting in remembering psychology are introduced in this section to investigate the reinforcement learning about value functions, specially for SARSA(Î») algorithm, and we established a class of forgetting algorithms be fit for value functions reinforcement learning.We propose a new algorithm for reinforcement learning based-on utility clustering method. Some conceptions in POMDP are applied in the model of this kind of algorithm. The main problem in U-tree algorithm is the creating of margin nodes with randomness and computational loads for statistical inspection. All of extended methods have not solved this problem above. My dissertation advance a new reinforcement learning based-on utility clustering method called after U-Clustering. This algorithm does not need to creating and inspection for margin nodes, instead of that it clusters the instances firstly according to values of observed actions of the instances chain, and then selects the features for every cluster, finally, compresses those features. The compressed features will become the new leaf nodes.This dissertation contributes the dynamic merge algorithm for solving partially observed Markov decision processes (POMDP). In this method, the conception of region is applied, and an region system on state space of environment is established. The Agent implements its optimal goals independently on each regions of region system, which likes work parallel in some of Agents. And then, we shall merge these optimal functions of each region in some way to one and get the optimal solution of POMDE In this section, we also analyze the complexity of this new algorithm and do simulation for its results. We apply the famous simulation system, New York Driving, for our algorithm, and the results show that the dynamic merge algorithm is a very effective one to solve large state space problems.This dissertation proposes the generalized average algorithm for reinforcement learning Agent based-on the conception of risk-sensitive Markov decision processes. The motivation of this algorithm is to obtain the robustness for a kind of reinforcement learning algorithms via immolation of optimality potentially. The reason for proposing this algorithm is that the robustness will become a very important problem if non-matching exists between theoretic model and practice physical system, or workability of control actions will change along with the time. Here we investigate the dynamic programming algorithm in reinforcement learning through substituting the maximum or supremum operator to generalized average operator and also discuss its convergence. The goal of this idea is to improve robustness for a kind of reinforcement learning algorithms.We contribute a new idea to the risk-sensitive evolution policy iteration algorithm for solving reinforcement learning problem and discuss the optimality of polices for this algorithm. This idea comes from the appearance of the curse of dimension in computational process, for example, in Markov decision processes, it's not practical for improving policy computation using the general policy iteration or value iteration method. The problem in this algorithm here we focus on is how to obtain optimal or good policies when the action space is very large and the state space small relatively. In the processing of policy improving, this algorithm deal with the policy problem directly via the method of policy transition, instead of having no use for maximizing computation to value functions on whole action space.In the end of this dissertation, we investigate the optimal equations for solving multi-time risk-sensitive Markov decision processes problem. It is more and more important to practical applications with large-scale control problems because of needing that the Agent can accommodate more complex environment. A more complex and practical environment is employed in this section we refer it to be multi-time Markov decision processes model. Moreover, the conception of risk-sensitive is applied and the conception of multi-time risk-sensitive Markov decision processes is proposed firstly. This is a new kind of problems to solving reinforcement learning algorithm based-on Markov models. We use the conceptions of utility function and risk-sensitive to discuss the problems of two-time scale risk-sensitive Markov decision control case. And then we obtain the formal Bellman's optimal control equation for two-time scale risk-sensitive Markov decision processes and prove that the optimal value function can be satisfied this Bellman's optimal control equation. Meanwhile, we also give some relative results and their proofs.

Keywords/Search Tags:

Reinforcement Learning, Markov Decision Processes, Algorithms Risk-Sensitive, Optimal Policy

PDF Full Text Request

Related items

1	The Optimal Solution Of Risk-sensitive Markov Control Processes
2	Variance Optimization For Continuous-time Markov Decision Processes
3	Markov Strategies Of The Two Algorithms, The Probability Threshold Rule
4	Discrete Time Markov Decision Processes Based On Variance Constraint
5	Research Of Exploration Policy For Multi-Agent Reinforcement Learning And Implementation Of Game Decision System
6	Optimal Control For Discrete-Time Markov Processes: New Optimality Conditions And Approaches
7	Partial Observation Of Memory-based Reinforcement Learning Problems In Markov Decision Process
8	Continuous-time Markov Decision Processes In Random Environments
9	Optimal Control Of Discrete-Time Systems:Average-Reward-Based Reinforcement Learning Methods
10	Complexity Theory And Planning Algorithms In Partially Observable Markov Decision Procresses