Modeling controllable aspects of the environment enable better prioritization of interventions and has become a popular exploration strategy in reinforcement learning methods. Despite repeatedly achieving State-of-the-Art results, this approach has only been studied as a proxy to a reward-based task and has not yet been evaluated on its own. We show that solutions relying on action prediction fail to model important events. Humans, on the other hand, assign blame to their actions to decide what they controlled. Here we propose Controlled Effect Network (CEN), an unsupervised method based on counterfactual measures of blame. CEN is evaluated in a wide range of environments showing that it can identify controlled effects better than popular models based on action prediction.
翻译:模拟环境的可控制方面可以更好地确定干预措施的优先次序,并已成为加强学习方法的流行探索战略。尽管一再取得最新成果,但这一方法仅作为基于奖励的任务的替代物加以研究,尚未自行评估。我们表明,依靠行动预测的解决方案无法模拟重大事件。另一方面,人类要怪罪于其决定自己控制什么的行动。我们在这里提议了控制效应网络,这是基于反事实的指责措施的一种不受监督的方法。CEN在广泛的环境中进行了评估,表明它能够比基于行动预测的流行模式更好地确定控制效果。