Policy optimization in reinforcement learning requires the selection of numerous hyperparameters across different environments. Fixing them incorrectly may negatively impact optimization performance leading notably to insufficient or redundant learning. Insufficient learning (due to convergence to local optima) results in under-performing policies whilst redundant learning wastes time and resources. The effects are further exacerbated when using single policies to solve multi-task learning problems. Observing that the Evidence Lower Bound (ELBO) used in Variational Auto-Encoders correlates with the diversity of image samples, we propose an auto-tuning technique based on the ELBO for self-supervised reinforcement learning. Our approach can auto-tune three hyperparameters: the replay buffer size, the number of policy gradient updates during each epoch, and the number of exploration steps during each epoch. We use a state-of-the-art self-supervised robot learning framework (Reinforcement Learning with Imagined Goals (RIG) using Soft Actor-Critic) as baseline for experimental verification. Experiments show that our method can auto-tune online and yields the best performance at a fraction of the time and computational resources. Code, video, and appendix for simulated and real-robot experiments can be found at the project page \url{www.JuanRojas.net/autotune}.
翻译:强化学习的政策优化要求在不同环境中选择众多的超参数。 纠正错误可能会对优化绩效产生负面影响, 特别是导致学习不足或冗余。 学习不足( 与本地optima趋同) 导致政策表现不佳, 而冗余学习时间和资源浪费。 当使用单项政策解决多任务学习问题时, 效果会进一步恶化 。 我们观察到, 变化式自动计算器( ELBO) 使用的“ 证据更低功能( ELBO) ” 与图像样本的多样性相关, 我们提议以ELBO 为基础, 自动调整技术, 进行自我监督强化学习。 我们的方法可以自动调制三个超参数: 缓冲大小、 每个时代的政策梯度更新次数, 以及每个时代的勘探步骤数量 。 我们使用一个最先进的自我监督机器人学习框架( 用软动作- Crital( REG) 来进行实验。 实验显示, 我们的方法可以在线自动调, 并产生最佳效果, 在每小段时间/ 附录 的 代码 和 测试中 。