强化学习的反向内在动力 (Adversarial Intrinsic Motivation for Reinforcement Learning)

Learning with an objective to minimize the mismatch with a reference distribution has been shown to be useful for generative modeling and imitation learning. In this paper, we investigate whether one such objective, the Wasserstein-1 distance between a policy's state visitation distribution and a target distribution, can be utilized effectively for reinforcement learning (RL) tasks. Specifically, this paper focuses on goal-conditioned reinforcement learning where the idealized (unachievable) target distribution has full measure at the goal. This paper introduces a quasimetric specific to Markov Decision Processes (MDPs) and uses this quasimetric to estimate the above Wasserstein-1 distance. It further shows that the policy that minimizes this Wasserstein-1 distance is the policy that reaches the goal in as few steps as possible. Our approach, termed Adversarial Intrinsic Motivation (AIM), estimates this Wasserstein-1 distance through its dual objective and uses it to compute a supplemental reward function. Our experiments show that this reward function changes smoothly with respect to transitions in the MDP and directs the agent's exploration to find the goal efficiently. Additionally, we combine AIM with Hindsight Experience Replay (HER) and show that the resulting algorithm accelerates learning significantly on several simulated robotics tasks when compared to other rewards that encourage exploration or accelerate learning.

翻译：在本文中,我们调查了这样一个目标,即将政策州访问分布与目标分布之间的瓦森斯坦-1-1距离最小化的政策,是否可以有效地用于强化学习任务。具体地说,本文件侧重于在理想化(无法实现的)目标分布具有充分衡量目标的情况下,以有目标条件的强化学习,以尽量减少与参考分布的不匹配为目的;本文件介绍了针对Markov 决策程序(MDPs)的准度参数,并使用这一准度来估计瓦塞斯坦-1以上距离。它进一步表明,将瓦塞斯坦-1距离降到最低的政策是尽可能少步达到目标的政策。我们称为Aversarial Intrinsic Motition(AIM)的方法,通过双重目标估计瓦瑟斯坦-1距离,并用它来计算补充性奖励功能。我们的实验表明,这一奖励功能在MDP的过渡方面变化顺利,并指导代理人的探索以找到目标。此外,我们把AVIIM与其他的学习速度大大地结合起来,在模拟学习时,我们把AVIM的进度与其他的进度结合起来。