| Nowadays,Model-free deep reinforcement learning(DRL)algorithms have been successfully applied to a range of challenging sequential decision making and control tasks.Among them,we believe fixed temperature Soft Actor Critic algorithm(SAC)[1]has the dramatic increasing on experimental results comparing with other algorithms.However,we found theoretical problems which are in the introduction of entropy term in maximum entropy objective function underling in SAC algorithm theory.Although the introduction of entropy term in fixed temperature Soft Actor Critic algorithm can be used to improve the encouraging-exploration effect of fixed temperature Soft Actor Critic algorithm.It may also cause potential problems in SAC algorithm theory such as optimization deviation and Q value overestimation.So we dig in the theory which causes optimization deviation and Q value overestimation in maximum entropy objective function of SAC algorithm and formulate a modified framework based on them,which perfectly resolve the above problems hidden in maximum entropy objective function.As a result,we establish our algorithm called Constrained Soft Actor Critic algorithm(CSAC)to resolve the problems in SAC algorithm and keep the same encouraging-exploration effect as SAC algorithm do.Although Constrained Soft Actor Critic algorithm can resolve the problems in fixed temperature Soft Actor Critic algorithm perfectly.Constrained Soft Actor Critic algorithm also shows a problem called exploitation bottleneck,which is actually the instability shown in trailing process.So we establish our modified algorithm called Stable Constrained Soft Actor Critic algorithm(SCSAC)to further resolve the exploitation bottleneck underlying in Constrained Soft Actor Critic algorithm,which is actually to improve the stability of our algorithm in trailing process.Further,we find the policy improvement theory in Stable Constrained Soft Actor Critic algorithm has potential problem in finding optimal policy process,so we establish our Further Revised Stable Constrained Soft Actor Critic algorithm(FRSCSAC)to revise this problem.In conclusion,all of our algorithms can resolve optimization deviation and Q value overestimation problems meanwhile keeping the same encouraging-exploration effect as SAC algorithm do.Besides,all of our algorithms have a large amount of theoretical derivation and theoretical proofs to support the establishment of our algorithms.So we believe all of our algorithms are theoretical intact with their theoretical deviation and theoretical proofs.Last but not the least,the training,trailing and Q function overestimation experiments of our algorithms show our algorithms can significantly reduce Q function overestimation appearance meanwhile keeping the approaching results comparing with SAC algorithm.So we believe our algorithms can be easily applied to the real-world applications with appropriate modification. |