| Car ownership has been increasing rapidly in China,leading to increasingly severe problems such as declining road traffic efficiency,frequent traffic accidents,and energy waste.With the development of V2 V communication and automatic driving,Collaborative Adaptive Cruise Control and Platoon have proved to be able to improve the above problems.Compared to individual driving of a single vehicle,controlling and managing a fleet of connected intelligent vehicles with the same route can improve driving safety,among other things.The control field has seen a rise in the use of deep reinforcement learning(DRL)due to its ability to handle non-linear and high-dimensional state problems,as well as its characteristics in achieving sustainable learning in policy networks and disregard for low-level system dynamics.In this paper,Multi-Agent Reinforcement Learning(MARL)is selected as the longitudinal control strategy of vehicle platoons,and the main focus content are described as follows:(1)The real-time joint simulation framework for vehicle platoons is built using SUMO and PLEXE.This framework simulates a realistic training environment for vehicles and uses various DRL methods to train longitudinal control strategies under optimal communication conditions.The stability performance of vehicle platoons is then analyzed and validated under different operating conditions.(2)PS-MACDDPG control method is proposed for a vehicle platoon that is composed of homogeneous vehicles.This method integrates the concepts of the MADDPG algorithm,where the critic network evaluates the states and actions of all agents to address the nonstationarity of the environment.Meanwhile,each agent uses global rewards to train for cooperation.Finally,by sharing the parameters of various policy networks,the scalability of the vehicle platoons was improved and the problem of slow convergence in multi-agent concurrent training was addressed.(3)In the training process,a gradient training strategy that is based on a normal distribution is put forward to control the acceleration of the lead car.In the training,the acceleration of the lead vehicle is randomly sampled according to its speed in normal distribution,so as to simulate human driving behavior.Simultaneously,by increasing the proportional factor with the increase in the number of training steps during the training process,the difficulty of the training scenarios was gradually increased from easy to hard,thereby improving the training performance of the model.According to the simulation results,the proposed method exhibits better performance than the random training strategy based on uniform distribution.(4)For the update of the policy network,the model optimization method based on historical reward is used.The update process of the policy model involves automatic adjustment of the learning rate,which is done by monitoring the average rating of the recent n steps of each agent’s historical rewards to optimize the model.When the monitored performance index do not increase for several consecutive steps,lowering the learning rate can be used to avoid model training failure.The results of the simulation suggest that the proposed method is more effective than both the fixed learning rate method and the method based on historical Critic scores. |