Our ability to know when to trust the decisions made by machine learning systems has not kept up with the staggering improvements in their performance, limiting their applicability in high-stakes domains. We introduce Prover-Verifier Games (PVGs), a game-theoretic framework to encourage learning agents to solve decision problems in a verifiable manner. The PVG consists of two learners with competing objectives: a trusted verifier network tries to choose the correct answer, and a more powerful but untrusted prover network attempts to persuade the verifier of a particular answer, regardless of its correctness. The goal is for a reliable justification protocol to emerge from this game. We analyze variants of the framework, including simultaneous and sequential games, and narrow the space down to a subset of games which provably have the desired equilibria. We develop instantiations of the PVG for two algorithmic tasks, and show that in practice, the verifier learns a robust decision rule that is able to receive useful and reliable information from an untrusted prover. Importantly, the protocol still works even when the verifier is frozen and the prover's messages are directly optimized to convince the verifier.
翻译:我们了解何时信任机器学习系统所作决定的能力没有跟上其性能惊人的改善,限制了其在高占用域域的适用性。 我们引入了Prover-Ver化游戏(PVGs),这是一个游戏理论框架,鼓励学习代理商以可核查的方式解决决定问题。 PVG由两个目标相竞的学习者组成:一个值得信任的核查者网络试图选择正确的答案,一个更强大但不信任的验证者网络试图说服某个特定答案的核查者,不管其正确性如何。 目标是从这个游戏中产生一个可靠的解释性协议。 我们分析了框架的变量,包括同时和相继游戏,并将空间缩小到一组游戏的范围,这组游戏具有理想的平衡性。 我们为两种算法任务开发了PVG的即时钟,并表明在实践中,核查者学到了强有力的决定规则,能够从一个不信任的验证者那里获得有用和可靠的信息。 关键是,即使在核查者被冻结时,协议仍然起作用,而且验证者的信息也直接优化,以说服核查者。