Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficiently capable RL agents always find ways to bypass their intended objectives by shortcutting their reward signal? This question impacts how far RL can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we study when an RL agent has an instrumental goal to tamper with its reward process, and describe design principles that prevent instrumental goals for two different types of reward tampering (reward function tampering and RF-input tampering). Combined, the design principles can prevent both types of reward tampering from being instrumental goals. The analysis benefits from causal influence diagrams to provide intuitive yet precise formalizations.
翻译:人类能否获得具有专横能力的强化学习(RL)代理进行投标? 或者足够有能力的RL代理总是会通过缩短其奖赏信号而找到绕过其预定目标的方法? 这一问题影响着RL的大小,以及是否必须开发替代模式来建立安全的人工智能。 在本文中,我们研究的是,当RL代理有一个工具目标来破坏其奖赏过程,并描述防止两种不同的奖赏篡改(奖励功能篡改和RF-投入篡改)的工具性目标的设计原则。 综合起来,设计原则可以防止两种奖赏篡改行为成为工具目标。 分析从因果关系图中得益于提供直观而精确的正规化。