AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3% on an 8th Grade science exam challenge. This paper reports unprecedented success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90% on the exam's non-diagram, multiple choice (NDMC) questions. In addition, our Aristo system, building upon the success of recent language models, exceeded 83% on the corresponding Grade 12 Science Exam NDMC questions. The results, on unseen test questions, are robust across different test years and different variations of this kind of test. They demonstrate that modern NLP methods can result in mastery on this task. While not a full solution to general question-answering (the questions are multiple choice, and the domain is restricted to 8th Grade science), it represents a significant milestone for the field.
翻译:AI在象棋、戈和扑克这样的游戏上取得了非凡的掌握能力,甚至还有Jeps、Go和Poker,甚至Jeopardy,但大量的标准化考试仍然是一项里程碑式的挑战。 即使到了2016年,最好的AI系统在八年级科学考试的挑战中只取得了59.3%的成绩。 本文在8年级纽约Regents科学考试中报告了前所未有的成功,8年级在纽约Regents科学考试中第一次在考试的非直径、多重选择(NDMC)问题上得分超过90%。 此外,我们的Aristo系统在最新语言模型的成功基础上,在相应的12年级科学Exam NDMC问题上超过了83%。 有关隐性测试问题的结果在不同测试年和这种测试的不同变化中都非常活跃。 它们表明现代NLP方法可以导致对这项任务的掌握。 虽然对一般问题的解答(问题有多重选择,领域仅限于8年级科学)没有完全的解决办法,但它是该领域的一个重要里程碑。