To build general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the generalization index, or the g-index, to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.
翻译:为了建立通用的人工智能系统,处理各种未知的未知变量,我们需要一些基准,以衡量这些系统在它们以前从未看到过的任务上的表现如何。一个先决条件是衡量任务的一般化困难,或它与系统先前的知识和经验有何不同。如果在特定领域确定一个情报系统的技能,因为它能够不断生成一套指示(或程序),以解决该领域的任务,那么,目前的基准无法量化地衡量获得新技能的效率,从而有可能通过以无限数量的数据和计算能力进行培训来获得粗力技能。考虑到这一点,我们首先提出一种共同的教学语言,一种编程语言,这种语言允许以定向的循环图的形式表达方案,横跨广泛的现实世界领域和计算平台。我们用这种语言产生的程序来显示一种匹配的方法,既能评分业绩,又能计算任何特定任务的一般化难度。我们用这些基准来界定一个数字性基准,称为通用指数,或g-index,用来用无限数量的数据和计算能力。我们首先提出一种通用的教学语言语言语言语言语言,这种语言可以用来表达程序,用以衡量和比较我们所了解的系统的总体智能系统的效率。