Identifying controllable aspects of the environment has proven to be an extraordinary intrinsic motivator to reinforcement learning agents. Despite repeatedly achieving State-of-the-Art results, this approach has only been studied as a proxy to a reward-based task and has not yet been evaluated on its own. We show that solutions relying on action-prediction fail to model critical controlled events. Humans, on the other hand, assign blame to their actions to decide what they controlled. This work proposes Controlled Effect Network (CEN), an unsupervised method based on counterfactual measures of blame to identify effects on the environment controlled by the agent. CEN is evaluated in a wide range of environments showing that it can accurately identify controlled effects. Moreover, we demonstrate CEN's capabilities as intrinsic motivator by integrating it in the state-of-the-art exploration method, achieving substantially better performance than action-prediction models.
翻译:确定环境的可控制方面已证明是强化学习剂的非凡内在动力。尽管一再取得最新成果,但这一方法仅作为基于奖励的任务的替代物加以研究,尚未自行评价。我们证明依靠行动预防的解决办法无法模拟受控制的重大事件。另一方面,人类将责任归咎于其决定控制什么的行动。这项工作提议了控制效果网络,这是基于反事实指责措施的一种不受监督的方法,目的是查明对代理人控制的环境的影响。在广泛的环境中对环境进行评估,表明它能够准确地确定受控制的影响。此外,我们通过将它纳入到最先进的探索方法中来显示环境网的内在动力,其性能比行动预防模式要好得多。