To build increasingly general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure precisely how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, i.e. a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the g-index to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.
翻译:要建立日益通用的人工智能系统,可以处理各种未知领域的未知变量,我们需要精确衡量这些系统在它们以前从未看到过的任务上表现得如何的基准。这是衡量任务一般化困难或与系统先前的知识和经验有多么不同的一个先决条件。如果将特定领域的情报系统的技能定义为它能够不断生成一套指示(或程序)以解决该领域的任务,则目前的基准无法量化地衡量获得新技能的效率,从而有可能通过以无限数量的数据和计算能力进行培训来获得粗力技能。为此,我们首先提出一种通用的教学语言,即一种编程语言,使程序能够以定向的环绕图的形式表达整个系统在各种现实世界领域和计算平台上的各种程序。我们用这种语言生成的程序演示一种匹配的方法,既能评分业绩,又能计算任何特定任务的总的困难。我们用这些基准来界定一个数字基准,即用无限数量的数据指数来测量和计算我们所知道的指数来测量和比较其总体智能的精确度。我们用什么系统来计算其精确度。我们使用这些基准来界定一个数字性基准,用来测量和比较其总智能系统的效率。