| Flow line is a widely adopted production mode.When there are more than 3 machines,the flow shop scheduling problem is a complex NP-hard problem.Research on the problem is of great theoretical and engineering value.Traditional approaches to the scheduling problem such as mathematical modeling,heuristic and meta-heuristic algorithm,are hard to handle with dynamic change of conditions of resources and tasks,though they can obtain solutions close to the optimal in a short time.Deep reinforcement learning is able to take actions responding to dynamic states directly,it is more appropriate for the manufacturingstate-responsive production process.Therefore,a deep reinforcement learning algorithm is proposed for the first time to apply to the Non-permutation Flow-Shop Scheduling(NPFS)problem.Firstly,the basic theories including Deep Learning(DL)based on neural network and Reinforcement Learning(RL)based on Markov Decision Process(MDP)are introduced.The framework of Deep Temporal Difference Network(DTDN)reinforcement learning algorithm is established.Secondly,the NPFS problem is described.15 manufacturing state features are numerically defined and candidate action set consists of 28 constructive heuristic methods and dispatching rules is represented.The reward function is defined according to the objective of minimizing the Makespan.So NPFS problems is transformed to MDPs.The proposed approach is applied to solve flcmax benchmark problems,comparing with Simple Constructive Heuristic(SCH)and Ant Colony System(ACS)methods.The algorithm is able to obtain solutions lower than the upper bound of original problems in shorter iteration times.Its solution quality is obviously better than that of compared methods and its performance validates the effectiveness of the DTDN algorithm.Thirdly,given the Multi-objective Optimization Problem(MOP)model and the MultiObjective RL(MORL)is described,including its basic architecture and solutions.Then the synthetic objective of minimizing Makespan and energy consumption is established to test the multi-objective DTDN algorithm on Taillard benchmark problems with varying parameter multiple-policy method.It shows that the approach can obtained good pareto solutions.An improvement advice is proposed according to the comparative analysis of the experimental outcomes given different learning rate parameters.At last,dynamic scheduling problem and its commonly used performance indices and rescheduling policy are described.The NPFS problem with dynamic order arrivals using Car instances is devised.The experiment results further testify the dynamic adaptive ability of the DTDN algorithm adopting event-driven policy.The concluding part summarizes the primary research production and puts forward further researching prospects. |