| With the development of artificial intelligence technology,machine learning methods such as reinforcement learning and deep learning have developed into the hottest cutting-edge technologies.Reinforcement learning learns optimal strategies through the interaction between the agent and the environment.In recent years,many inverse reinforcement learning algorithms for learning reward functions and optimal strategies have been proposed to solve the problem of difficult manual design of reward functions in reinforcement learning.In real-world and complex environments,overfitting,exploding gradient,and vanishing gradient problems may occur when learning reward functions and optimal policies using inverse reinforcement learning because of limited,non-optimal expert demonstration and improper training.And there are still few studies for inverse reinforcement learning.Therefore,this paper focuses on probability-based inverse reinforcement learning methods,establishs the maximum entropy model,and proposes a series of algorithms for learning reward functions and performing policy optimization.The main contents of this paper include the following four parts.1.Because the maximum entropy algorithm for learning reward functions and optimal policies in inverse reinforcement learning(IRL)suffers from the problems of computational complexity,overfitting and poor convergence,a maximum entropy IRL(ME-TFTPRL IRL)for online proximal optimization based on truncated gradients is proposed.The follow-the-proximally-regularized-leader(FTPRL)method with a better sparse solution is used as the proximal optimization to improve the generalization performance of the ME-TFTPRL algorithm.Regularization and adaptive state learning rate are used to select features and correct the update direction of reward weights for reducing model complexity,avoiding overfitting,and speeding up convergence.In each iteration,the truncated gradient(TG)method is used to update the reward weights,avoiding the floating-point problem of the FTPRL method.Then,the sparsity and convergence of ME-TFTPRL IRL are proved based on regularization,TG method and regret bound.The experimental results show that the proposed algorithm has good sparsity,generalization and convergence speed.2.The ensemble maximum entropy deep inverse reinforcement learning algorithm(AME-DIRL)is proposed in the case of the imbalanced of the expert demonstration data and overfitting problem.The method overcomes the imbalance problem of the data set by combining multiple maximum entropy deep inverse reinforcement learning processes to form a strong learner.In addition,to deal with the problem of complex computations in the AME-DIRL algorithm,the rewards obtained by the strong learner are sparse using the TG method,thus reducing the complexity of the model.To prevent overfitting problem,a correction factor is added to the linear combination of these weak learners.Experimental results show that the AME-DIRL algorithm has a high accuracy in learning rewards.3.To address the problems of non-optimal expert demonstration,low learning efficiency,and overfitting in reward function fusion,maximum entropy inverse reinforcement learning based on soft Q-learning and Adaboost method(SQL-AME-IRL)is proposed to recover the reward function from the generated expert demonstration.First,the improved soft Q-learning algorithm is used to learn the best state-action pairs considered as expert demonstration,which improves the explorability and robustness.To break through the problem of low learning efficiency,the learning task is divided into multiple subtasks each of which is solved using a strong learner integrated by multiple maximum entropy inverse reinforcement learning.The rewards recovered by the strong learner are combined into one by a linear combination method to solve the overfitting problem and achieve the fusion of the recovered reward functions.The experimental results show that SQL-AME-IRL has good performance in learning reward functions and strategies.4.To improve the learning efficiency and convergence,and address the problem of limited expert demonstrations in high-dimensional complex environments,the adaptive generative adversarial inverse reinforcement learning(AGA-MEIRL)algorithm is proposed for learning optimal rewards and policies with small-sample expert demonstrations.In AGA-MEIRL,the overfitting problem of the discriminator and the pattern collapse problem are solved by iteratively integrated multiple generative adversarial IRLs.To overcome the vanishing gradient problem,the activation function SELU is introduced.In addition,to solve the exploding gradient problem,a gradient clipping method is added to the model to make the training of the model more stable.The convergence of AGA-MEIRL is analyzed based on the upper bound of Ada GAN.Experiments show that the proposed AGA-MEIRL can learn good reward functions and strategies under small-sample expert demonstrations. |