Effective exploration is critical for reinforcement learning agents in environments with sparse rewards or high-dimensional state-action spaces. Recent works based on state-visitation counts, curiosity and entropy-maximization generate intrinsic reward signals to motivate the agent to visit novel states for exploration. However, the agent can get distracted by perturbations to sensor inputs that contain novel but task-irrelevant information, e.g. due to sensor noise or changing background. In this work, we introduce the sequential information bottleneck objective for learning compressed and temporally coherent representations by modelling and compressing sequential predictive information in time-series observations. For efficient exploration in noisy environments, we further construct intrinsic rewards that capture task-relevant state novelty based on the learned representations. We derive a variational upper bound of our sequential information bottleneck objective for practical optimization and provide an information-theoretic interpretation of the derived upper bound. Our experiments on a set of challenging image-based simulated control tasks show that our method achieves better sample efficiency, and robustness to both white noise and natural video backgrounds compared to state-of-art methods based on curiosity, entropy maximization and information-gain.
翻译:有效的探索对于在缺乏奖励或高度状态行动空间的环境中加强学习动力至关重要。最近基于州访问计数、好奇心和增殖的工程产生了内在的奖赏信号,以激励该代理访问新探索国家。然而,该代理可能由于对含有新颖但与任务相关的信息的传感器输入的干扰而分心,这些输入含有新颖但与任务相关的信息,例如由于传感器噪音或背景的变化。在这项工作中,我们引入了顺序信息瓶颈目标,通过模拟和压缩时间序列观测中的时间序列预测信息来学习压缩和时间一致的表述。为了在噪音环境中进行有效的探索,我们进一步建立内在的奖赏,根据所学的表述来捕捉与任务相关的新事物。我们从顺序信息瓶颈目标中得出一个变式的上限,以实际优化为目的,并对衍生的上限提供信息理论解释。我们在一系列具有挑战性的图像模拟控制任务方面的实验表明,我们的方法取得了更好的样本效率,并且对白色噪音和自然视频背景的坚固度,而与基于好奇心轴、最大最大化和信息再分析的状态方法相比,我们的方法则实现了。