强化学习中的时间抽象与继承表示 (Temporal Abstraction in Reinforcement Learning with the Successor Representation)

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation (SR), which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the SR can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent's representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the SR allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for exploration and on the use of the SR to combine them. The results of our experiments shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the SR, such as eigenoptions and the option keyboard.

翻译：多层次时间抽象推理是智能的关键属性之一。在强化学习中，这通常通过称为选项的时间扩展行动过程进行建模。选项允许代理在环境中以不同的抽象级别进行预测和操作。然而，基于选项框架的方法通常假设在先已知一个合理的选项集。当这种情况不成立时，就没有确定性的答案指出应该考虑哪些选项。本文认为，继承表示（SR），其基于遵循它们的状态访问模式对状态进行编码，可以看作是发现和使用时间抽象的自然基础。为了支持我们的论点，我们从宏观角度审视了最近的结果，展示了SR可用于发现有助于时间扩展的探索或规划的选项。我们将这些结果转化为通用框架的实例，以此框架将代理的表示用于确定有用的选项，然后用这些选项进一步改进其表示。这导致一种良性、永无止境的循环，在其中表示和选项基于彼此不断地改进。除了选项发现本身外，我们还讨论了SR如何将一组选项增加到组合数量庞大的对应物中而无需其他学习。这是通过组合之前学习的选项实现的。我们的实证评估集中在探索发现的选项和使用SR将它们组合起来。我们实验的结果揭示了定义选项所涉及的重要设计决策，并展示了基于SR的不同方法之间的协同作用，例如特征选项和选项键盘。