The inherent non-stationary environment in multi-agent tasks has a great impact on the learning of agent policies.In order to alleviate the non-stationarity of the environment,researchers use the method of opponent modeling to enhance the decision-making information of agents by modeling the relevant attributes of other agents.However,the existing opponent modeling methods either require the agent to have a high understanding of the environment,or assume that the agent can access the information of other agents.The global observation of the environment makes the application scenario of the method very limited.Meanwhile,accessing the information of other agents is not practical for the real world.Therefore,this paper considers some observable environments and uses opponent modeling to alleviate the non-stationary of the environment.The core research contents of this paper are as follows:(1)This paper proposes Opponent Modeling with Limited Cognition(POLO)algorithm to alleviate the non-stationarity of the environment.Based on the idea of intention driven behavior,the algorithm first proposes to use the attention mechanism to explicitly model the intentions of other agents.Then infer the policies of other agents based on intention.The reasoning ploicy is planned into its own learning process to deal with the non-stationarity of the environment,so as to improve the learning efficiency of the agent.Finally,experiments are carried out on three multi-agent tasks: cooperative navigation,target consistent navigation and emergency rescue of injured animals.The results show that Polo algorithm based on intention reasoning strategy can effectively improve the convergence speed of the algorithm and achieve higher reward.(2)This paper proposes Opponent Modeling with Limited Cognition by Long-Term Memory(POLO-LTM)algorithm based on POLO algorithm to further stabilize the learning process of the value network.In reinforcement learning,reward guides the optimization of agent policy.The quality of value network directly affects the performance of policy network learning.Thus,it is proposed to enhance the information of the agent value network,so that the agent can learn a better value network to better guide the learning of policies.In addition,inspired by the past experience of human reuse,long short-term memory is introduced into the agent’s network,so that the agent can reuse the historical information to enhance the local information,further alleviate the nonstationarity of the environment,and improve the learning speed of the agent.Finally,the effectiveness of POLO-LTM algorithm is verified in the cooperative navigation and target consistent navigation environment.The results show that the convergence speed and final reward of POLO-LTM algorithm are significantly better than other baseline algorithms. |