We present a temporally extended variation of the successor representation, which we term t-SR. t-SR captures the expected state transition dynamics of temporally extended actions by constructing successor representations over primitive action repeats. This form of temporal abstraction does not learn a top-down hierarchy of pertinent task structures, but rather a bottom-up composition of coupled actions and action repetitions. This lessens the amount of decisions required in control without learning a hierarchical policy. As such, t-SR directly considers the time horizon of temporally extended action sequences without the need for predefined or domain-specific options. We show that in environments with dynamic reward structure, t-SR is able to leverage both the flexibility of the successor representation and the abstraction afforded by temporally extended actions. Thus, in a series of sparsely rewarded gridworld environments, t-SR optimally adapts learnt policies far faster than comparable value-based, model-free reinforcement learning methods. We also show that the manner in which t-SR learns to solve these tasks requires the learnt policy to be sampled consistently less often than non-temporally extended policies.
翻译:我们用T-SR.t-SR.t-SR来表示后继代表制的暂时性变异,我们称之为t-SR.t-SR,通过在原始行动重复时建立后继代表制来捕捉时间性延长行动的预期国家过渡动态。这种暂时性抽象化的形式并不从上到下地学习相关任务结构的等级,而是从上到下地学习,同时由同时行动和行动重复组成的自下而上构成。这减少了在不学习等级政策的情况下在控制中所需的决策量。因此,t-SR直接考虑时间性延长的行动序列的时间范围,而不需要预先确定或具体领域的选择。我们表明,在具有动态奖励结构的环境中,t-SR既能够利用后继代表制的灵活性,又能够利用时间性延长行动提供的抽象性。因此,在一系列微薄无报酬的网格世界环境中,t-SR最优地调整了所学过的政策,其速度远远超过以价值为基础的无型强化学习方法。我们还表明,t-SR学会解决这些任务的方式要求学习的政策往往比不同时延长的政策要经常抽样。