| In recent years,facing the challenge of the resource competition,how to make full use of the limited or scarce resources to create the largest value for society is extremely urgent.In the 21st century,the scheduling optimization problem is focused by many researchers.How to quantify the resource society value and maximize resource utilization value,becomes an urgent problem to be solved.Container scheduling optimization is one of the typical problems.Reinforcement learning is very good to solve such problems.Reinforcement learning refers that:the agents interact with the environment and learn how to ’play’ with the environment through trial and error.The main contents of the paper are as follows:(1)According to the background of the container scheduling optimization,this paper achieves the modeling for this problem.The main structure is divided into the destination selection block,feature extracting block,reward design block,and agent control block.Use the reinforcement learning to train.(2)With the prior knowledge of the spatial topology relationship between ports,this paper presents an algorithm named A-DDQN(Attention-based Deep Double Q Learning).The mechanism uses attention to capture the neighbor port’s feature,which makes the result more reasonable and accurate.(3)Aiming at the poor interpretability of deep neural networks,this paper presents an algorithm named L-DQN(LightGBM-based Double Q Learning).With the help of ensemble learning’s strong fitting ability,the result is mode stable.At the same time,the importance of different feature can help researchers to make analysis.(4)Aiming at the unstable action exploration in reinforcement learning,this paper presents an algorithm named PN-DDPG(Parameter Noise based Deep Deterministic Policy Gradient).With the help of self adapting noise parameter,the action exploration is more stable.Three algorithms proposed in this paper are applied to the container scheduling optimization proble.The results verify the algorithms’effectiveness and superiority on two evaluation indicators.The agent trains through trial and error,and the result is better than human decision. |