向强力生物模拟学习迈进 (Towards Robust Bisimulation Metric Learning)

Learned representations in deep reinforcement learning (DRL) have to extract task-relevant information from complex observations, balancing between robustness to distraction and informativeness to the policy. Such stable and rich representations, often learned via modern function approximation techniques, can enable practical application of the policy improvement theorem, even in high-dimensional continuous state-action spaces. Bisimulation metrics offer one solution to this representation learning problem, by collapsing functionally similar states together in representation space, which promotes invariance to noise and distractors. In this work, we generalize value function approximation bounds for on-policy bisimulation metrics to non-optimal policies and approximate environment dynamics. Our theoretical results help us identify embedding pathologies that may occur in practical use. In particular, we find that these issues stem from an underconstrained dynamics model and an unstable dependence of the embedding norm on the reward signal in environments with sparse rewards. Further, we propose a set of practical remedies: (i) a norm constraint on the representation space, and (ii) an extension of prior approaches with intrinsic rewards and latent space regularization. Finally, we provide evidence that the resulting method is not only more robust to sparse reward functions, but also able to solve challenging continuous control tasks with observational distractions, where prior methods fail.

翻译：深度强化学习(DRL)中的学习表现必须从复杂的观测中提取与任务相关的信息,平衡强健性与分散性之间的平衡,平衡政策的信息。这种稳定而丰富的表述,通常通过现代功能近似技术学习,能够实际应用政策改进理论,即使在高维连续的状态行动空间也是如此。模拟衡量标准通过在代表空间将功能上相似的州一起解体,促进对噪音和分散器的偏差,为这种学习问题提供了一种解决办法。在这项工作中,我们将价值函数对政策上的强化指标的近似界限推广到非最佳政策和近似环境动态。我们的理论结果有助于我们确定可能实际使用的嵌入病理。特别是,我们发现这些问题来自一种缺乏足够控制的动态模式,以及将规范不稳地依赖于在报酬微弱的环境中的奖赏信号。此外,我们提出了一套切实可行的补救办法:(一)对代表空间的规范制约,以及(二)将先前的奖赏和潜在空间规范方法扩展至于非最佳政策和环境动态。最后,我们提供了证据,我们提供了由此而导致的分心不稳性观测的任务不是在以往更稳的道路上持续地进行。