For some problems, humans may not be able to accurately judge the goodness of AI-proposed solutions. Irving et al. (2018) propose that in such cases, we may use a debate between two AI systems to amplify the problem-solving capabilities of a human judge. We introduce a mathematical framework that can model debates of this type and propose that the quality of debate designs should be measured by the accuracy of the most persuasive answer. We describe a simple instance of the debate framework called feature debate and analyze the degree to which such debates track the truth. We argue that despite being very simple, feature debates nonetheless capture many aspects of practical debates such as the incentives to confuse the judge or stall to prevent losing. We then outline how these models should be generalized to analyze a wider range of debate phenomena.
翻译:对于某些问题,人类可能无法准确判断AI提出的解决办法的好坏。Irving等人(2018年)建议,在这种情况下,我们可以利用两个AI系统之间的辩论来扩大人类法官解决问题的能力。我们引入一个数学框架,可以模拟这种类型的辩论,并建议辩论设计的质量应以最有说服力的答案的准确性来衡量。我们描述了一个称为特征辩论的辩论框架的简单例子,并分析了这种辩论跟踪真相的程度。我们争辩说,尽管辩论非常简单,但特色辩论仍然能够捕捉实际辩论的许多方面,例如混淆法官或拖延防止失败的动机。我们然后概述这些模式应如何被普遍化,以分析更广泛的辩论现象。