We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$. We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return. Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it "narrower") or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case. A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.
翻译:我们给出了奖赏黑客的第一个正式定义, 即优化不完善的代理奖效功能( $mathcal ~tilde{R ⁇ $ ), 导致根据真正的奖赏功能( $\mathcal{R}$ $ ) 业绩不佳的现象。 我们说, 如果增加预期的代理回报永远无法减少预期的真实回报, 代理是不可破坏的。 直观地说, 有可能通过将某些条件从奖赏功能( 使其“ 更窄”) 中留置出来, 或忽略微小的区别, 大致相等的结果, 这种现象通常不会发生。 一个关键的观点是, 奖赏的线性( 在州际行动访问计数中) 使不可破坏性成为非常强烈的条件。 特别是对于所有随机性政策的组合, 两个奖赏功能只有在其中的一个不变的情况下, 才能创建不可破坏的代理。 因此, 我们可能会把注意力转向确定性的政策和有限的质疑性政策, 并且总是存在非三重不可破的对等的对等,, 并且建立必要和足够的条件, 来显示存在紧张性。