Reinforcement learning(RL)is one of the expected theoretical methods to solve Unmanned Aerial Vehicle(UAV)path planning problems.It constructs a reward function through observation of the environment by the agent itself.It integrates feedback from the environment in a multi-source fusion interaction to ultimately obtain a policy that can plan the optimal path by continuously deepening its understanding of the environment and adjusting its state.Although RL algorithms perform well in most path-planning tasks,their estimates of action selection based on the Q-values are inaccurate during the initial training period due to the agent’s incomplete understanding of the environment and a lack of relevant information.Some algorithms attempt to obtain more accurate Q-values by changing the calculation method of the Q-values,thereby improving the accuracy of action selection.However,this strategy cannot fundamentally solve the problem of the agent’s lack of environmental awareness during initial training.In addition,applying RL requires a known reward function,and in complex environments,manually designing and adjusting the reward function based on the environment is a significant challenge.One of the alternative solutions to the problem of unknown reward functions is Inverse Reinforcement Learning(IRL).IRL extracts information from expert demonstrations to generate a policy that can produce equivalent results to those of the expert.The performance of inverse reinforcement learning depends on the quality of the provided demonstrations.Most IRL algorithms do not distinguish between demonstrations during path planning,resulting in an equivalent influence on the final policy between the optimal and suboptimal demonstrations,which can lead to suboptimal outcomes.Additionally,these algorithms typically require further RL using the inferred reward function to obtain the path,making the problem even more complex.To address the problems faced by these algorithms,this thesis proposes two more universal,robust,and faster-converging solutions.Specifically,the research content and innovative contributions of this thesis mainly include the following two points:1.Proposed a Path Planning Method Based on Artificial Potential Field(APF)and Deep Q-Network(DQN)An improved deep Q-network(DQN)path planning method called B-APFDQN was developed,which uses artificial potential field methods as a reference for action selection by the agent,effectively solving the problem of slow algorithm convergence caused by the need for frequent trial and error by the agent.This method employs a neural network structure that can simultaneously output action distribution and Q-values,in contrast to the classical DQN structure that can only output Q-values and is combined with APF methods to accelerate the agent’s training process.B-APFDQN also uses a SA--greedy algorithm to allow the agent to adjust the search frequency of the surrounding environment based on the search progress,ensuring sufficient exploration of the surrounding environment without getting stuck in local optima.After removing redundant nodes from the obtained path,the B-spline algorithm is used for optimization,resulting in smooth and easily executable trajectories for unmanned aerial vehicles.2.Proposed a Path Planning Method Based on Improved Maximum Entropy Inverse Reinforcement Learning and Probabilistic RoadmapA path planning method,DTW-Max Ent-PRM,was proposed based on improved maximum entropy inverse reinforcement learning(IRL)and probabilistic roadmap(PRM).The dynamic time warping(DTW)algorithm was used to distinguish the role of different demonstrations in inferring the reward function,improving the Maximum Entropy IRL(Max Ent IRL)method.This method addresses the problem of obtaining non-optimal results when the expert demonstrations are not distinguished,resulting in equal contributions to the reward function inference by the optimal and suboptimal demonstrations.After inferring the reward function structure,the task space is sampled based on the reward function and expert demonstrations.Then a method for re-generating trajectories using PRM is employed instead of using reinforcement learning to infer trajectories from the reward function,avoiding the complex process of using RL in IRL after inferring the reward function.The task space containing the optimal path is gradually reduced through the iterative reconstruction of expert demonstrations and re-sampling of the task space.The significant difference between B-APFDQN and DTW-Max Ent-PRM lies in that the former accelerates the search process by directly intervening in the action selection through the expert method when the reward function is known,while the latter extracts information from expert demonstrations to complete the search task when the reward function is unknown.However,the most significant similarity between the two is that they utilize expert knowledge to guide the agent’s path search in the task space as much as possible.The related experiments were conducted in different grid environments and compared with classical and improved path-planning algorithms.Results showed that B-APFDQN and DTW-Max Ent-PRM achieved shorter paths and exhibited higher robustness and generality. |