情报措施 (The Measure of Intelligence)

To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

翻译：为了在更加智能和更加人性化的人工系统方面取得深思熟虑的进展,我们需要遵循一个适当的反馈信号:我们需要能够以能够对两种系统进行比较和与人类进行比较的方式界定和评价情报。在过去的100年里,在心理学和AI两个领域,人们都曾作出大量努力来定义和衡量情报。我们总结和严格评估这些定义和评价方法,同时展示隐含地指导了这些系统的两个历史智慧概念。我们注意到,在实践上,当代大赦国际界仍然倾向于将情报基准化,比较AI和人类在诸如棋盘游戏和视频游戏等具体任务中所表现出的技能。我们争论说,仅仅衡量任何特定任务所显示的技能都比衡量情报要差得多,因为以往的知识和经验在很大程度上调整了技能:没有限制或无限制的培训数据使实验者能够“购买”一个系统的任意技能水平,从而掩盖了系统本身的概括化能力。我们接着阐述了一个新的正式的情报定义,即基于AgoricIFory、描述智能的形式是获取技能的效率和视频游戏游戏游戏游戏等具体任务。我们说,仅仅衡量任何任务的技能,而最后又强调一个人类基准概念的概念,我们应该用一个更精确地界定一个概念来确定一个总的尺度。我们以前的标准。我们用一个概念来确定一个总的精确的尺度,应该用来确定一个概念。