| With the development of deep learning and Internet of Things technology,the concept of everything being interconnected has given rise to automation control scenarios in multi-agent environments.The multi-agent reinforcement learning algorithm has also flourished in such environments and has been widely used in fields such as logistics transportation,automation,robot control,and simulation.Currently,the Q-value decomposition-based multi-agent reinforcement learning algorithm focuses on the relationship between the joint action-value function and the local action and has derived many algorithms with excellent performance.However,various Q-value decomposition methods only apply some information from the local observation of the agents and do not make full use of environmental state information.In order to increase the efficiency of using environmental information and agent action information,this thesis proposes a multi-agent reinforcement learning algorithm based on weighted Q-value decomposition.The specific research content is as follows:(1)In order to solve the problem that the weight function of the traditional Weighted QMIX algorithm(WQMIX)is simple,and the weighting method of the joint action lacks diversity,which caused missing the effect of some actions on the loss function,a Mixed Weighted QMIX(MWQMIX)algorithm that uses a mixed weight function to weight the joint action is proposed.The MWQMIX divides the joint action into four parts more carefully through the new weight function,which can make full use of the action information in the joint action compared with the traditional WQMIX algorithm.In the way of derivation and proof,this thesis shows that the joint action can transform the loss function into four expressions with different forms and value ranges under the division of four different conditions.Under the influence of weight values with different conditions,the value domains of the four loss function expressions are mutually disjoint,and can play a role in the backpropagation process,so as to realize the effective use of the action information contained in the joint action.The experimental results of two simple environments,one complex environment and one extremely difficult environment of the Star Craft Multi-Agent Challenge platform(SMAC)show that the convergence speed of the MWQMIX algorithm is faster than other Q-value decomposition algorithms.When faced with more complex environments the behavior of the agent controlled by the MWQMIX algorithm is more stable,and at the same time it can jump out of the local optimal state faster,preventing the agents from being in a long-term blind exploration state due to falling into the local optimal state.(2)In order to solve the problems of low utilization rate of the state information contained in the environment and insufficient interpretability of the use of environmental information by the Q-value decomposition algorithm,a multi-agent reinforcement learning algorithm based on the non-linear decay probability action selection strategy is proposed.In order to balance the agents’ exploration probability of the environment and the utilization probability of the known state,the agents select actions according to a random strategy based on nonlinear decay probability control.Under the guidance of this strategy,the agents can maintain a high probability of exploring the environment state in the early stage of training when the environment information is not fully grasped;after fully grasping the environment state information,the strategy will attenuate the random action selection probability to close to 0.So that the agents can not only use the known information to obtain rewards through the environment stably,but also continue to explore the environment with a probability close to 0 to prevent falling into local optimum.The experimental results in four environments with increasing difficulty provided by the SMAC platform show that under the guidance of the stochastic strategy controlled by the nonlinear decay probability,the agents can fully explore the environment and obtain higher accumulative reward and can make the agents jump out of the local optimal state faster.In addition,in order to enable the agents to quickly rely on environmental information to generate cooperative relationships in the early stage of training,the algorithm allows the agent to select actions according to the adjacent action selection strategy based on the gap connection principle.According to the position information between the agents,let the agents imitate the behavior of the adjacent agents with a probability that is inversely proportional to the distance between them,so that the agents that are close to each other can quickly establish a cooperative relationship.The experimental results of two groups of six agents on the SMAC platform in different environments show that the algorithm using the adjacent action selection strategy can allow the algorithm to better explore the environment and achieve a higher convergence speed,while preventing agents from trapping in a local optimum for a long time. |