Reinforcement learning (RL) algorithms have achieved notable success in recent years, but still struggle with fundamental issues in long-term credit assignment. It remains difficult to learn in situations where success is contingent upon multiple critical steps that are distant in time from each other and from a sparse reward; as is often the case in real life. Moreover, how RL algorithms assign credit in these difficult situations is typically not coded in a way that can rapidly generalize to new situations. Here, we present an approach using offline contrastive learning, which we call contrastive introspection (ConSpec), that can be added to any existing RL algorithm and addresses both issues. In ConSpec, a contrastive loss is used during offline replay to identify invariances among successful episodes. This takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon than it is to prospectively predict reward at every step taken in the environment. ConSpec stores this knowledge in a collection of prototypes summarizing the intermediate states required for success. During training, arrival at any state that matches these prototypes generates an intrinsic reward that is added to any external rewards. As well, the reward shaping provided by ConSpec can be made to preserve the optimal policy of the underlying RL agent. The prototypes in ConSpec provide two key benefits for credit assignment: (1) They enable rapid identification of all the critical states. (2) They do so in a readily interpretable manner, enabling out of distribution generalization when sensory features are altered. In summary, ConSpec is a modular system that can be added to any existing RL algorithm to improve its long-term credit assignment.
翻译:强化学习( RL) 算法近年来取得了显著成功,但仍在长期信用分配中挣扎着根本性问题。 在成功取决于多个关键步骤的情况下,成功取决于彼此相距遥远的多个关键步骤,取决于微弱的奖励;现实生活中的情况往往是如此。此外,在这些困难情况下,RL算法如何分配信用,通常不是以能够迅速概括到新形势的方式进行编解。在这里,我们提出了一个使用离线对比式学习的方法,我们称之为快速反向演算法(Conspec),这可以添加到任何现有的RL算法中,并解决这两个问题。在离线重播期间,使用对比式损失来确定成功事件之间的异性。这利用了一个事实,即追溯性地确定成功取决于的一小系列步骤比预测环境每一步的奖励都要容易。 把这些知识储存在总结成功所需的中间状态的原型(Conspec) 。在培训期间,到达任何与这些原型相匹配的,这些原型在离线重的 RBER 中产生一个内在的评级, 使CL 能够使CRBES 获得任何内部的评级。