We present temporally abstract actor-critic (TAAC), a simple but effective off-policy RL algorithm that incorporates closed-loop temporal abstraction into the actor-critic framework. TAAC adds a second-stage binary policy to choose between the previous action and a new action output by an actor. Crucially, its "act-or-repeat" decision hinges on the actually sampled action instead of the expected behavior of the actor. This post-acting switching scheme let the overall policy make more informed decisions. TAAC has two important features: a) persistent exploration, and b) a new compare-through Q operator for multi-step TD backup, specially tailored to the action repetition scenario. We demonstrate TAAC's advantages over several strong baselines across 14 continuous control tasks. Our surprising finding reveals that while achieving top performance, TAAC is able to "mine" a significant number of repeated actions with the trained policy even on continuous tasks whose problem structures on the surface seem to repel action repetition. This suggests that aside from encouraging persistent exploration, action repetition can find its place in a good policy behavior. Code is available at https://github.com/hnyu/taac.
翻译:我们提出了时间抽象的行动者-批评(TAAC),这是一个简单而有效的脱离政策的RL算法,将封闭环状时间抽象抽取纳入行动者-批评框架。TAAC增加了一个第二阶段的二进制政策,以在先前的行动和一个新的行动者的行动输出之间作出选择。关键是,它的“行动或重复”决定取决于实际抽样行动,而不是行为者的预期行为。这个后动作转换方案让整个政策作出更知情的决定。TAAC有两个重要特征:(a) 持续探索,和(b) 新的多步TD备份比较通式Q操作员,特别针对行动重复情况。我们展示了TAAC在14项连续控制任务中在若干强基线上所具有的优势。我们令人惊讶的发现,在取得顶级业绩的同时,TAAC能够“布雷”大量重复行动,即使经过训练的政策在表面的问题结构似乎可以重复行动。这表明,除了鼓励持续探索外,行动重复也可以在良好的政策行为行为中找到它的位置。代码可在 https://giuthubcom/hnhnata查阅。