| As an important part of machine learning,reinforcement learning mainly modifies the behavior of the controlled agent by interacting with the environment to obtain the maximum benefit,thus obtaining the optimal strategy.By combining the decision-making ability of reinforcement learning with the perception ability of deep learning,the application field of reinforcement learning can be expanded to a higher dimensional state and action space environment,and many successes have been achieved in the fields of games,robots and automatic driving.With the expansion of research field,multi-agent reinforcement learning algorithm has developed into an important research branch of reinforcement learning,and multiagent collaboration is the main research problem.At present,the mainstream multi-agent reinforcement learning algorithm uses the "centralized training,distributed execution" training framework.Although using this framework can effectively solve the nonsteady problem caused by the considerable information in the multi-agent environment,in the multi-agent collaboration scenario,there is still a lack of effective capture of the agent strategy intention information.As a concept of cognitive psychology,theory of mind refers to the ability to understand the thoughts and intentions of others,which widely exists in the collaborative activities of human society.This thesis combines theory of mind with multi-agent algorithm,and proposes new algorithms for the existing problems of multi-agent reinforcement learning algorithm in multi-agent cooperation scene and human-machine cooperation scene respectively.In the multi-agent collaboration scenario,this thesis uses neural model to model the mental model,combines the mental model network with the mainstream multi-agent reinforcement learning algorithm framework,and proposes a multi-agent proximal policy optimization algorithm(ToMMAPPO)based on the theory of mind,which enables the multi-agent reinforcement learning algorithm to explicitly learn the strategic intentions of other agents,so as to improve the final synergy;In the human-machine collaboration scenario,considering the differences of human models in the scenario,this thesis uses the method of inverse reinforcement learning to model the mental model,so that the multi-agent algorithm can learn to perceive the reward information of its own human model,and optimize the design of its own reward function,and ultimately improve the synergy effect of human models,Based on the above framework,we propose the rewarding theory of mind-multi-agent proximal policy optimization algorithm(ReToM-MAPPO)and the rewarding theory of mind-implicit q-learning algorithm(ReToM-IQL)based on the reward correction of mental theory.In order to verify the performance of the algorithm proposed in this thesis,we also designed and implemented a multi-agent collaborative training platform based on the extensible simulation platform Xsim(XSim Studio,abbreviated as Xsim),improved its multi-agent training and human-machine collaborative training related functions,and finally verified ToM-MAPPO in the UAV collaborative multi-target coverage and human-machine collaborative multi-target coverage scenarios based on this platform The performance of ReToM-MAPPO and ReToM-IQL algorithms is better than other existing mainstream multi-agent reinforcement learning algorithms. |