We propose learning via retracing, a novel self-supervised approach for learning the state representation (and the associated dynamics model) for reinforcement learning tasks. In addition to the predictive (reconstruction) supervision in the forward direction, we propose to include `"retraced" transitions for representation/model learning, by enforcing the cycle-consistency constraint between the original and retraced states, hence improve upon the sample efficiency of learning. Moreover, learning via retracing explicitly propagates information about future transitions backward for inferring previous states, thus facilitates stronger representation learning. We introduce Cycle-Consistency World Model (CCWM), a concrete instantiation of learning via retracing implemented under existing model-based reinforcement learning framework. Additionally we propose a novel adaptive "truncation" mechanism for counteracting the negative impacts brought by the "irreversible" transitions such that learning via retracing can be maximally effective. Through extensive empirical studies on continuous control benchmarks, we demonstrates that CCWM achieves state-of-the-art performance in terms of sample efficiency and asymptotic performance.
翻译:我们建议,除了在前方的预测(重建)监督外,我们还建议,通过执行原始和被收回的邦之间的周期一致性限制,将代表/示范学习的“累进式”过渡纳入“累进式”过渡,从而提高学习的抽样效率。此外,通过收回明确传播有关未来过渡的信息,以推断前几个邦,从而推动更有力的代表性学习。我们引入了周期-常识世界模式(CCWM),这是在现有基于模式的强化学习框架下通过回溯执行的具体的“回溯式”学习即时化。我们提议建立一个新的适应性“调整”机制,以抵消“不可逆性”过渡带来的消极影响,即通过回溯式学习可以取得最大效果。我们通过对持续控制基准的广泛经验研究,表明《特定常规武器公约》在抽样效率和防患性表现方面实现了最先进的业绩。