Font Size: a A A

Option Learning Method Research With Double Actor-Critic Architecture

Posted on:2024-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y P XuFull Text:PDF
GTID:2568306941964059Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Hierarchical reinforcement learning is an important branch of reinforcement learning,which is mainly used to solve complex problems such as long time sequence decision making and sparse reward.It is one of the research hotspots in the field of reinforcement learning at present.As the main research direction of hierarchical reinforcement learning,option learning method has formed a powerful hierarchical decision-making ability through temporal abstraction technology,and has achieved good results in many fields of real society,such as automatic driving,recommendation system,natural language processing,game,etc.In the option learning algorithm,making each option have a certain problem-solving ability will help the agent obtain higher environmental rewards through rational use of options in complex tasks.The introduction of the double actor-critic architecture is an effective way to activate all options equally.This paper focuses on the problems of low sample efficiency,unstable policy and low training rate of the option learning algorithm based on the double actor-critic architecture,mainly including the following three aspects:ⅰ.In the original double actor-critic architecture,only the on-policy learning method was adapted,resulting in low sample efficiency in option learning methods.To address this issue,a double actor-critic option learning method based on off-policy updates is proposed.This method improves sample efficiency by introducing off-policy and random nonsequential sampling training methods,and increases information constraints to further improve structural stability.Multiple sets of experimental results show that this method significantly improves sample efficiency while ensuring hierarchical structural stability when compared to other methods.ⅱ.Currently,option learning methods only consider state information when guiding low-level actions under the guidance of high-level policies,resulting in insufficient use of option information and unstable low-level policies.To address this issue,a double actorcritic option learning method based on trajectory information is proposed.This method uses diferent types of information in option trajectories for guidance,introduces option similarity measurement as internal reward to improve the stability of low-level policies.Multiple sets of experimental results show that this algorithm has better stability.ⅲ.Traditional experience replay cannot effectively utilize excellent experiences in the experience buffer,which reduces the training efficiency of the algorithm.In addition,traditional methods consider only one aspect of excellent experience and do not fully reflect the option combination ability of hierarchical structures.To address this issue,an option learning method based on superior experience is proposed.This method classifies option advantage information and controls the sampling ratio of superior experience to improve the utilization rate of excellent experience in historical samples and accelerate the training process of agents.Multiple sets of experimental results show that this method can speed up the training rate of agents and has high performance.
Keywords/Search Tags:hierarchical reinforcement learning, double actor-critic architecture, off-policy, trajectory information, superior experience
PDF Full Text Request
Related items