| With the rapid development of technology,the current multi-agent autonomous decision-making method based on deep reinforcement learning has been applied to many unmanned autonomous control systems such as traffic flow regulation,energy distribution,robot control,game AI design,and military command decision-making.Using computers and autonomous intelligent decision-making algorithms for wargame deduction,mining the potential tactics of wargame AI in the real-time battlefield environment,and broadening the combat thinking of commanders is the need for the development of current wargame deduction,and it is also the key to exploring and applying advanced artificial intelligence technology.In order to improve the ability of wargame AI to make intelligent real-time decisions based on the environment situation,speed up the deduction process of the confrontation situation,and reduce the application restrictions of multi-agents in the wargame deduction environment,the real-time performance of deep reinforcement learning algorithms is very necessary.In this thesis,based on the deep reinforcement learning algorithm of autonomous learning,the data usage efficiency of the multi-agent real-time autonomous decision-making method and the problem of sparse instant rewards in the wargame environment are studied.The main tasks are as follows:(1)According to the characteristics of the wargame deduction environment,this thesis conducts feature design on the state space and action space of wargame AI.Based on the real-time confrontation wargame environment,the key characteristic information of the confrontation situation between the enemy and the friend is generated according to the original data provided by the environment engine.Secondly,this thesis optimizes the original action space provided by the wargame environment,which simplifies the interaction process between multi-agents and the environment,and speeds up the learning of policy networks.(2)A priority experience-based multi-agent autonomous decision-making method(Priority Trajectory Multi-Agent Policy Gradient,PTMAPG)is proposed.Based on the multi-agent actor-critic algorithm framework,the method for multi-agents to use experience data to explore the environment is improved,and each agent has a preference through the priority experience replay method based on TD-N and probability summation tree.Exploring the environment and using historical experience data effectively improves the data usage efficiency of the multi-agent decision-making model.(3)A Multi-agent Proximal Policy Optimization Algorithm Based on Adaptive Intrinsic Reward Function(AIR-MAPPO)is proposed.Based on the multi-agent proximal strategy optimization algorithm,the distribution of immediate rewards in the environment is adjusted through the internal reward function based on environmental factors to reduce its sparsity.And through the cognitive hybrid network module,the evaluation network’s cognition and feature fusion ability of the local environment state are enhanced,and the training speed and decision-making ability of the algorithm model are improved.Finally,the PTMAPG algorithm and AIR-MAPPO algorithm proposed in this thesis are tested in the wargame environment mountain 3v3 and water network 3v3,and their effectiveness in real-time decision-making environments is verified. |