Font Size: a A A

Reinforcement Learning Of Continuous-time Markov Decision Processes With Applications In UAV Control

Posted on:2016-11-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:S D JiaFull Text:PDF
GTID:1312330536967198Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
For applying to the severe weathers and dynamic battlefield environments,unmanned aerial vehicles(UAVs)have to be confronted with challenges such as highly uncertainties,nonlinearities,multiple-input-multiple-output(MIMO),coupled dynamics and unstructured environments.In this thesis,with the perspective of artificial intelligence,we developed a novel theory on the continuous-time Markov decision processes(CTMDPs),which is adaptive with the requirements of the reinforcement learning(RL),and then applied it to the control problem of UAVs.The main contents are given by theories of the conventional Markov decision processes(MDPs),stochastic optimization and RL,with focus on the framework of the policy iteration algorithm,and the RL method of the potential-performance based CTMDPs.We also describe some successful applications of RL to designing a controller for autonomous UAV flights.As following,the main contributions are:1.Uncertainties in the control problem of UAVs are modeled by CTMDPs and then a potential-performance based CTMDPs model is set up.1)The MDPs provide a probabilistic framework regards to uncertainty,wherein the transitions between states are probabilistic.To dealing with the parameters of UAVs and environment which are time varying in the unstructured and dynamic environments,the MDP framework is instead by the CTMDPs that the latter one has continuous transfer times.Verified by the famous example called “two identical cars game”,the results show that the CTMDP framework has a better performance.2)Solving CTMDPs requires detail values of the parameters,which are hard to obtain for two reasons.Firstly,because the parameters are time varying,the transition probabilities as well as the infinitesimal generator of the transition probabilities are hard to obtain.Secondly,the lacking of a simple relationship between parameters and the trajectories of samples makes it is hard to evaluate these parameters through samples.Thus,we construct a CTMDPs model which based on potential performance.2.A potential-performance based policy iteration algorithm is developed for the CTMDPs.1)For the potential-performance based CTMDPs with a long-time average reward,we present some useful results: the lemma of basic policies,the sufficient and necessary condition for the optimal policy,and the policy iteration algorithm.2)It has been proven that the algorithm is converge and the solution is optimal in theory.3)With theory analysis,we obtain that the conventional MDPs is only a special case of the CTMDPs involving the identity transition rate matrix.4)To verify the methods,a highly dynamic problem is solved as a benchmark.The CTMDPs method can provide a discretization solution that is close to the analytical solution obtained by the differential game method.Besides,it shows strong robustness against changes in the transition probabilities,as compared with the conventional MDPs method.3.A potential-performance based reinforcement learning algorithm is proposed for the CTMDPs.1)The evaluations for the parameters of CTMDPs are first given including transition rate matrix and the transition probability of the embedded Markov chain.To evaluate the potential-performance function,not only a L-steps algorithm is given which is suited for offline learning,but also a temporal difference(TD)algorithm is given which is suited for online learning.The convergence for the potential-performance evaluation algorithms has been proven in theory and demonstrated by simulations.2)We present a potential-performance based reinforcement learning algorithm for CTMDPs called CTMDPs-RL,and also discuss how to avoid the local optimal.3)An standard RL testing problem(inverted pendulum)is chosen for comparing of our method and other classical ones.The CTMDPs-RL can provide a more fast speed of problem solving and a lower chance of being trapped in local optimal,as compared with the other methods including Q-learning,Actor-Critic,GENITOR,SANE and MDPs-based RL method.4.The CTMDPs-RL method is employed to design controllers for the UAV flight control.1)Consider two types of control problems in missions of UAV guidance,whose performance refer to long time accumulated criteria and ending point criteria respectively.An uniform framework for the two types of control problems is given.2)We apply the CTMDPs-RL method to two long time accumulated criteria problems,i.e.“up and down”and“S-turing”path following,as well as two ending point criteria problems,i.e.“height fixed ”and“velocity fixed”problems.The results of these examples show that,through learning,the proposed CTMDPs-RL method can provide a good control policy quickly for UAV control without requirements of the dynamics of UAV in prior.
Keywords/Search Tags:Continuous-time Markov decision processes, potentialperformance, policy iteration algorithm, reinforcement learing, UAV control
PDF Full Text Request
Related items