| Recently,thanks to the development of deep learning and the success of Alpha Go in playing chess with humans,deep reinforcement learning has received widespread attention from academia and industry.More and more advanced reinforcement learning algorithms have been proposed,and deep reinforcement learning has been widely applied in games,robots and intelligent driving.At present,the application of deep reinforcement learning in industrial production is less,especially in the decision-making for operational indices.The research of decision-making for operational indices is of great significance to realize the optimization of the whole production process under the changing complex environment.The decision-making for operational indices is to coordinate the operational indices of each production unit according to the working conditions of the production environment,so as to optimize the quality and yield of products in the whole production process.This production process involves a large number of physical and chemical reactions,which makes it difficult to establish a usable mechanism model.Moreover,this production process has complex nonlinearity and strong coupling.The change of production environment also has irregular phenomena.In order to solve this problem,academia and industry put forward many methods,but these methods usually need to solve in the data model,the accuracy of solution affected by modeling precision greatly,therefore,in the actual decision-making for operational indices is mainly obtained by process engineer according to experience and knowledge.In view of the above problems,this paper,relying on the project of National Natural Science Foundation of China(NSFC)"Closed-loop optimization decision-making method for multi-process process indices in complex industrial processes under dynamic environment"(61273031),The optimization decision-making method of process industrial operation indexes based on the Deep Reinforcement Learning(DRL)has been put forward,which can dynamically coordinate the operation indexes of each production unit according to the change of working conditions.The main work is as follows:(1)This paper presents and analyzes the decision-making for operational indices in process industry and describes its mathematical form.The framework of decision-making for operational indices based on reinforcement learning is established.Through the detailed introduction of the whole process industry,so as to combine with reinforcement learning to carry out a reasonable reinforcement learning modeling of decision-making for operational indices in process industry.The design of state,action and reward of the problem is-defined in reinforcement learning framework.(2)Aiming at the characteristics of high dimensional and continuity,a optimization algorithm based on Actor-Critic is designed.The algorithm designs an experience replay pool which stores a large amount of experience data in the past.During each learning and updating,the algorithm randomly samples a batch of empirical data from the experience replay pool and avoid frequent online sampling interactions with industrial processes.The proposed algorithm is applied to the beneficiation processing,compared with manual decision and a classical reinforcement learning algorithm(REINFORCE),it is found that the algorithm has higher yield on the basis of ensuring the product grade is qualified.Meanwhile,The proposed algorithm has the advantages of shorter learning process,faster convergence speed and fewer times of trial and error.(3)Aiming at the training process is easy to fall into local optimization,a optimization algorithm based on multi-action network ensemble is designed.This algorithm randomly initializes several Actor networks,and at the same time,each policy network extracts a batch of different data from the experience replay buffer for training.Therefore,multiple Actor networks will fall into different local optimal policy,In the concrete decision,evaluate each action through the Critic network,so as to make a final choice.Experimental results show that this method can overcome the problem that the strategy network is easily trapped in local optimization and improve the production performance. |