| Most Reinforcement Learning(RL)algorithms depend on many hyperparameters and their performance is very sensitive to them.Hyperparameter tuning is traditionally performed via an outer loop with methods such as grid search,random search,or Bayesian optimization.Tuning requires extensive computing resources and is challenging in practice,especially with curriculum learning or in multi-task settings since good hyperparameter values may be taskdependent.In contrast,some works in RL try to tune hyperparameters online in a single run,such as learning rates,the bootstrapping parameter λ in λ-returns,or any differentiable hyperparameters.Auto-tuning hyperparameters in a single run is appealing as it is much more efficient than traditional methods,however it is also much more challenging to achieve.The poor optimization of any hyperparameters can negatively impact an agent’s learning progress cumulatively.For instance,very small or large hyperparameter values may slow down convergence,or even lead to divergence.Even if it converges,the trained policy may be suboptimal and a slower convergence results in a waste of computational,space,and sample resources,the latter being particularly costly in robotics.Our work follows this promising line of research and proposes a novel auto-tuning method,which we demonstrate in the Reinforcement Learning with Imagined Goals(RIG)algorithm.We expect that our auto-tuning technique could also be adapted to other deep RL algorithms based on Variational Auto-Encoders(VAE)and using visual inputs,but we leave this for future work.In non-supervised reinforcement learning,RIG learns general-purpose skills in taskagnostic environments using only RGB images as observations without a reward function designed for each task.RIG leverages self-generated goals to collect vast skills from the environment.In RIG,an agent learns to reach goals produced by a VAE-based goal generator that is fine-tuned(online)with collected samples stored in a replay buffer.In this thesis,we propose to use the value of the negative ELBO,which is the optimized loss function in VAEs,to estimate the number of different goals.Based on this premise,our method auto-tunes three hyperparameters to optimize learning in VAE-based environments;namely:1.The number Ne of exploration steps in each epoch for sampling goals and interacting with the environment with a sufficient number of steps in each epoch;2.The replay buffer size Nb so as not to lose(or forget)transitions nor waste memory resources;3.The number Nθ of gradient updates in each epoch to optimally update the policy at each epoch.Our proposition is motivated by the following observations:1.the value of the negative ELBO is positively correlated with the diversity in the training samples and thus also positively correlated with the number Ng of different goals in the replay buffer for RIG,and 2.the hyperparameters Ne,θ,b=(Ne,Nθ,Nb)can be approximated or upperbounded by a multiple of Ng.Therefore,the negative ELBO can be used to auto-tune these three hyperparameters.The contributions of our work are threefold:1.We identify that the loss function of VAEs is related to the diversity of the samples;2.To avoid suboptimal learning or wasted computer or sample resources,we propose a methodology that auto-tunes three hyperparameters Ne,θ,b;3.We experimentally validate our approach on diverse domains,and we additionally report competitive performances in curriculum learning settings. |