Dealing with sparse rewards is a long-standing challenge in reinforcement learning (RL). Hindsight Experience Replay (HER) addresses this problem by reusing failed trajectories for one goal as successful trajectories for another. This allows for both a minimum density of reward and for generalization across multiple goals. However, this strategy is known to result in a biased value function, as the update rule underestimates the likelihood of bad outcomes in a stochastic environment. We propose an asymptotically unbiased importance-sampling-based algorithm to address this problem without sacrificing performance on deterministic environments. We show its effectiveness on a range of robotic systems, including challenging high dimensional stochastic environments.
翻译:处理微量回报是强化学习(RL)的一个长期挑战。 事后观察经验重现(HER)通过将一个目标的失败轨迹重新用作另一个目标的成功轨迹来解决这个问题。 这使得奖赏的最小密度和对多个目标的概括化成为可能。 但是,人们知道,这一战略产生了一种偏颇的价值功能,因为更新规则低估了在随机环境中出现坏结果的可能性。 我们提议采用一种无创意的、不带偏见的、基于重要性的采样算法来解决这个问题,而不牺牲在确定性环境中的性能。 我们在一系列机器人系统上展示其有效性,包括挑战高维度的随机环境。