We propose learning via retracing, a novel self-supervised approach for learning the state representation (and the associated dynamics model) for reinforcement learning tasks. In addition to the predictive (reconstruction) supervision in the forward direction, we propose to include "retraced" transitions for representation / model learning, by enforcing the cycle-consistency constraint between the original and retraced states, hence improve upon the sample efficiency of learning. Moreover, learning via retracing explicitly propagates information about future transitions backward for inferring previous states, thus facilitates stronger representation learning for the downstream reinforcement learning tasks. We introduce Cycle-Consistency World Model (CCWM), a concrete model-based instantiation of learning via retracing. Additionally we propose a novel adaptive "truncation" mechanism for counteracting the negative impacts brought by "irreversible" transitions such that learning via retracing can be maximally effective. Through extensive empirical studies on visual-based continuous control benchmarks, we demonstrate that CCWM achieves state-of-the-art performance in terms of sample efficiency and asymptotic performance, whilst exhibiting behaviours that are indicative of stronger representation learning.
翻译:我们建议,除了在前方的预测性(重建)监督外,我们还建议,通过执行原始和被收回的邦之间的周期一致性限制,将代表/示范学习的“累进式”过渡纳入“累进式”过渡,从而提高学习的抽样效率。此外,通过收回明确传播未来过渡落后于先前各州的信息,从而推动下游强化学习任务的更强有力代表性学习。我们引入了周期性一致性世界模型(CCWM),这是通过回溯进行的具体模式即时学习的模式。此外,我们提议建立一个新的适应性“调整”机制,以抵消“不可逆转”过渡带来的消极影响,例如通过回溯学习学习可以发挥最大效力。通过对基于视觉的连续控制基准进行广泛的经验研究,我们证明,《特定常规武器公约》在抽样效率和约束性业绩方面达到了最先进的表现,同时展示能够显示更强有力学习表现的行为。