Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research. Today, this challenge is especially relevant given the emergence of systems which appear to increasingly outperform human beings. In some cases, this "superhuman" performance is readily demonstrated; for example by defeating legendary human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially surpass human performance. Indeed, human annotations are often treated as a ground truth, which implicitly assumes the superiority of the human over any models trained on human annotations. In reality, human annotators can make mistakes and be subjective. Evaluating the performance with respect to a genuine oracle may be more objective and reliable, even when querying the oracle is expensive or impossible. In this paper, we first raise the challenge of evaluating the performance of both humans and models with respect to an oracle which is unobserved. We develop a theory for estimating the accuracy compared to the oracle, using only imperfect human annotations for reference. Our analysis provides a simple recipe for detecting and certifying superhuman performance in this setting, which we believe will assist in understanding the stage of current research on classification. We validate the convergence of the bounds and the assumptions of our theory on carefully designed toy experiments with known oracles. Moreover, we demonstrate the utility of our theory by meta-analyzing large-scale natural language processing tasks, for which an oracle does not exist, and show that under our assumptions a number of models from recent years are with high probability superhuman.
翻译:估计机器学习系统的性能是人工智能研究的长期挑战。今天,由于出现了似乎越来越优于人类的系统,这项挑战特别相关。在某些情况下,这种“超人”的性能很容易地表现出来;例如,在传统的两个玩家游戏中击败传奇的人类玩家;另一方面,评价可能超过人类性能的分类模型可能具有挑战性;事实上,人类的注释常常被视为一种地面真理,暗含着人优于任何经人文说明培训的模型。在现实中,人类的告解者可以犯错误和主观性。评价真正先知的性能可能更加客观和可靠,即使质询神器是昂贵或不可能的。在本文中,我们首先提出如何评价人类和模型在不为人所见的神器上的表现的挑战。我们用不完美的人类性说明来估计其准确性,我们的分析提供了一种简单的方法来检测和证明在这个环境中的超人性能,我们认为,在理解这个模型的阶段里,我们将帮助我们仔细地理解我们所设计到的理论的高度性理论的阶段,在进行我们所了解的、我们所了解的机能的机能的分类上,我们所了解的理论的高度的理论的理论的理论,我们用来证明我们所设计的高级的理论是用来证明。