| In recent years,the research on multi-agent reinforcement learning has attracted much attention,and the research on the value decomposition problem has attracted extensive attention of researchers.In the multi-agent value decomposition method,the value function of the whole environment can be expressed as the combination of each agent’s value function to improve the performance of the whole strategy.However,there are two main problems with current value decomposition methods:(1)The learning efficiency of value decomposition methods is low,and the learning efficiency is an important performance indicator of the algorithm.Improving the learning efficiency of the algorithm is of great significance;(2)The value decomposition methods have the problem of insufficient exploration,and the exploration ability is very important to the multi-agent reinforcement learning algorithm.Improving the exploration ability can avoid the agent’s strategy falling into the local optimization to obtain the multi-agent strategy with better performance.Based on the above problems,the solutions proposed in this paper are as follows:(1)This paper proposes a method of accelerating convergence mechanism based on the importance weighted feedback: WF-QMIX(Weighted Feedback-QMIX).The algorithm improves the learning efficiency of the value decomposition method by introducing a new set of action-value functions.Firstly,the algorithm introduces an importance weight parameter network to give a group of action-value importance weights to the agent’s strategy.Secondly,the selection gate structure is introduced into the algorithm.When the overall value of the action value combination given importance weight is closer to the target value through the mixing network,the algorithm reduces the difference between the original action-value combination and the action-value combination given importance weight,and the algorithm further updates the model to speed up its learning convergence speed;On the contrary,it will increase the difference between the two groups of action-value combination to improve the ability of exploration.The experimental results show that the learning convergence speed and final performance of the WF-QMIX method are better than other comparative algorithms.(2)This paper proposes an extended exploration mechanism method with variational exploration: WFVAE(Weighted-Feedback QMIX with Variational Exploration).The algorithm adjusts the dynamic strategy of the interaction between agent and environment by introducing a behavior pattern implicit variable to solve the problem of insufficient exploration in value decomposition methods.Firstly,the algorithm introduces the implicit variable of behavior pattern and associates it with the dynamic strategy of agents.Secondly,the algorithm realizes the dynamic adjustment of the interaction strategy between agent and environment by changing the implicit variables of behavior mode to expand the exploration space of the method and further improve the exploration ability.Experimental results show that the performance of the WFVAE method is better than other comparative algorithms. |