Identifying controllable aspects of the environment has proven to be an extraordinary intrinsic motivator to reinforcement learning agents. Despite repeatedly achieving State-of-the-Art results, this approach has only been studied as a proxy to a reward-based task and has not yet been evaluated on its own. Current methods are based on action-prediction. Humans, on the other hand, assign blame to their actions to decide what they controlled. This work proposes Controlled Effect Network (CEN), an unsupervised method based on counterfactual measures of blame to identify effects on the environment controlled by the agent. CEN is evaluated in a wide range of environments showing that it can accurately identify controlled effects. Moreover, we demonstrate CEN's capabilities as intrinsic motivator by integrating it in the state-of-the-art exploration method, achieving substantially better performance than action-prediction models.
翻译:确定环境的可控制方面已证明是强化学习剂的一个非凡的内在动力。尽管一再取得最新成果,但这一方法仅作为基于奖励的任务的替代物加以研究,尚未自行评价。目前的方法基于行动防范。另一方面,人类将责任归咎于其决定控制什么的行动。这项工作提出了控制效果网络(CEN ),这是一个未经监督的基于反事实指责措施的方法,以确定对代理人所控制的环境的影响。CEN 是在广泛的环境中进行评估的,表明它能够准确地识别控制的效果。此外,我们通过将其纳入最新勘探方法来证明CEN的能力是内在的动力,其表现比行动防范模式要好得多。