| Although the traditional UAV trajectory planning search algorithm has a strong path search capability,it cannot obtain prior knowledge from historical experience.Reinforcement learning has the ability to gain experience through an iterative learning process of trial and evaluation,and then obtain a state-action mapping strategy that maximizes revenue.Therefore,the strategy based on reinforcement learning can use the learned strategy as a priori knowledge in unknown environments or new tasks,so as to improve the efficiency of trajectory planning.Deep reinforcement learning utilizes the strong perception and characterization capabilities of deep neural networks to the environment to obtain optimization strategies in reinforcement learning,enabling the trajectory planning strategy learning model to have generalization capabilities for dynamic tasks or complex and changing environments.This thesis proposes a strategy selflearning method based on deep reinforcement learning for trajectory planning in a multiconstrained complex environment.Combining the characteristics of input information such as planning tasks,constraints,flight environment,and optimization objectives,key models such as state,action,reward function,and strategy-value deep network of deep reinforcement learning system are designed.In terms of state and action space design,the layered coding representation of the planning task,the global environment and the local environment of the aircraft realizes the graphical representation of the aircraft’s turning state and matching state;Using the complex constraints between the two matching navigation points to construct the feasible interval of the turning point and the feasible region of the next matching navigation point reduces the expression space of the action,which not only makes the trajectory obtained through exploration and decisionmaking meet the complex constraints conditions,and can effectively reduce the difficulty of decision-making and speed up the trajectory planning.In terms of reward function design,the reward function in the reinforcement learning of the optimization target design in the existing traditional trajectory planning system is combined with the use of reward shaping technology to introduce heuristic information into the reward function to improve the learning efficiency of the system.In terms of strategy learning and expression in the deep reinforcement learning process,combined with deep convolutional neural network and Actor-Critic method,the turning point planning strategy network and matching point planning strategy network are designed.The planning strategy network performs iterative learning in two steps: 1)The Monte Carlo tree search method is used to guide the unmanned aerial vehicle to explore the environment based on the planning strategy network and generate sample data.2)The planning strategy network learns the sample data and updates the strategy.Monte Carlo tree search has powerful strategy improvement capabilities,can generate better quality trajectory samples,and is beneficial to improve the learning efficiency of planning strategy networks.The experimental results show that the reinforcement learning system designed based on this thesis has self-learning ability and can accomplish the trajectory planning task well.The planning strategies learned have generalization capabilities in unknown environments or new tasks. |