We propose to learn to distinguish reversible from irreversible actions for better informed decision-making in Reinforcement Learning (RL). From theoretical considerations, we show that approximate reversibility can be learned through a simple surrogate task: ranking randomly sampled trajectory events in chronological order. Intuitively, pairs of events that are always observed in the same order are likely to be separated by an irreversible sequence of actions. Conveniently, learning the temporal order of events can be done in a fully self-supervised way, which we use to estimate the reversibility of actions from experience, without any priors. We propose two different strategies that incorporate reversibility in RL agents, one strategy for exploration (RAE) and one strategy for control (RAC). We demonstrate the potential of reversibility-aware agents in several environments, including the challenging Sokoban game. In synthetic tasks, we show that we can learn control policies that never fail and reduce to zero the side-effects of interactions, even without access to the reward function.
翻译:我们建议学习如何区分不可逆转的行动和不可逆转的行动,以便在加强学习中更好地作出知情的决策。 从理论考虑,我们表明可以通过简单的替代任务来学习大约的可逆性:随机按时间顺序排列随机抽样的轨迹事件。直观地说,在同一顺序上总是观察到的一对事件可能会被不可逆转的行动序列所分离。简单地说,学习事件的时间顺序可以完全以自我监督的方式完成,我们用这种方式来根据经验来估计行动的可逆性,而没有任何先例。我们提出了两种不同的战略,将可逆性纳入加强学习的代理、一种探索战略和一种控制战略(拉加)。我们展示了在包括挑战性索科班游戏在内的若干环境中的可逆性可逆性动力的潜力。在合成任务中,我们证明我们可以学习永远不会失败的控制政策,并且将互动的副作用减少到零,即使不能获得奖励功能。