Learning complex behaviors through interaction requires coordinated long-term planning. Random exploration and novelty search lack task-centric guidance and waste effort on non-informative interactions. Instead, decision making should target samples with the potential to optimize performance far into the future, while only reducing uncertainty where conducive to this objective. This paper presents latent optimistic value exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term rewards. We combine finite horizon rollouts from a latent model with value function estimates to predict infinite horizon returns and recover associated uncertainty through ensembling. Policy training then proceeds on an upper confidence bound (UCB) objective to identify and select the interactions most promising to improve long-term performance. We apply LOVE to visual control tasks in continuous state-action spaces and demonstrate improved sample complexity on a selection of benchmarking tasks.
翻译:通过互动学习复杂的行为需要协调的长期规划。随机探索和新颖的搜索缺乏以任务为中心的指导,缺乏关于非信息化互动的浪费努力。相反,决策应该针对有可能在远至未来实现最佳业绩的样本,同时只减少有利于这一目标的不确定性。本文件介绍了潜在的乐观价值探索(LOVE),这一战略有助于在面对不确定的长期回报时通过乐观进行深入探索。我们把潜在模型的有限地平线扩展与价值功能估计结合起来,以预测无限的地平线回报,并通过组合来恢复相关的不确定性。政策培训随后将确定和选择最有希望改进长期业绩的上限信心目标(UCB)进行。我们把爱应用于连续的州行动空间的视觉控制任务,并在选择基准任务时展示更好的样本复杂性。