| Aircraft carrier operations are a key part of modern maritime military operations,and the decisive part in aircraft carrier operations is to achieve safe and efficient operations scheduling decision of carrier-based aircraft formations.With the development of military science and technology,the use of traditional heuristic algorithms to assist manual decision-making is currently the most widely used method for carrier-based aircraft formation combat scheduling.However,the carrier ’s combat environment is high-risk and changeable,the carrier ’s deck space is narrow and the equipment is numerous,and the carrier-based aircraft needs to complete the scheduling tasks of guarantee,dispatch,and recovery in the complex deck space with dynamic uncertainty,which increases the difficulty to the quite complicated carrier operations.At the same time,due to the particularity of the combat mission,the scheduling algorithm needs to achieve the continuous dispatch and recovery of the carrier aircraft and be able to make online decisions about unexpected situations.The heuristic intelligent algorithm for decision-making calculation of tasks in batches will limit the continuous dispatch capability and online combat capability of the carrier aircraft.In view of the difficulties faced by traditional scheduling algorithms in achieving multi-target online scheduling in a high-risk and variable environment,a deep reinforcement learning algorithm based on Markov Decision Process is proposed to solve the multi-target online scheduling in the continuous dispatch and recovery of large-scale carrier aircraft.The main contributions of this article include:(1)Aiming at the problem of continuous dispatch and recovery of multi-objective online scheduling for carrier-based aircraft,it is proposed to reduce the displacement of the ship surface,reduce the number of meet,balance the equipment utilization rate and stabilize the scheduling cycle as the scheduling decision objective,according to the Markov Decision Process,construct online scheduling real-time decision-making model that takes carrier-based aircraft and equipment status as input and scheduling behavior value function as output.Design an Action-Mask mechanism to improve the efficiency of action selection,set the reward as a weighted feature vector,and quantify the multi-objective problem as a single-objective solution,which is match to the actual problem.This model can effectively make scheduling decisions in the online scheduling experiment of continuous dispatch and recovery of shipborne aircraft with emergency conditions.(2)Aiming at the dynamic uncertainty of the recovery scheduling problem of shipboard aircraft deployment,the deep reinforcement learning method is used to optimize the scheduling decision.The dynamic uncertainty is measured,and as part of the state,the state transition is carried out with the Markov Decision Process at each decision point.In order to avoid overestimation,this paper uses Double Q-learning algorithms to construct two neural networks for action selection and evaluation.Using the variable ε-greedy strategy to select the action to execute,and at the same time add the Batch Normalization layer to the neural network layer,perform batch regularization on the input data,use adaptive activation function and perform gradient clipping during the reverse transfer gradient calculation to avoid gradients explosion during neural network training.The scheduling strategy obtained by the optimized deep reinforcement learning can achieve multi-objective optimization,which has obvious advantages compared with heuristic algorithms and scheduling rules.(3)In view of the research problem is in part of the observable environment in this paper,in order to obtain more information from the environment for network training and obtain a more comprehensive and accurate decision model,this paper uses a Deep Recurrent Q-learning algorithm to train the model by time sequence.At the same time,the attention mechanism and the priority experience replay mechanism are added to the recurrent neural network,which can accelerate the convergence speed,and at the same time explore better strategies,the resulting strategies are also more stable. |