Can a machine learn machine learning? We propose to answer this question using the same criteria we use to answer a similar question: can a human learn machine learning? We automatically answer final exams in MIT's, Harvard's and Cornell's large machine learning courses and generate new questions at a human level. Recently, program synthesis and few-shot learning solved university-level problem set questions in mathematics and STEM courses at a human level. In this work, we solve questions from final exams that differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We provide a new dataset and benchmark of questions from machine learning final exams and code for automatically answering these questions and generating new questions. To make our dataset a reproducible benchmark, we use automatic checkers for multiple choice questions, questions with numeric answers, and questions with expression answers, and evaluate a large free language model, Meta's OPT, and compare the results with Open AI's GPT-3, ChatGPT, and Codex. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from human-generated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning, chain-of-thought prompting, GPT-3, ChatGPT, and OPT pre-trained on text and Codex fine-tuned on code on a range of machine learning topics and find that few-shot learning methods perform best. We make our data and code publicly available for the machine learning community.
翻译:机器能学习机器吗? 我们提议用我们用来回答类似问题的相同标准来回答这个问题: 人类学习机器能学习吗? 我们自动回答麻省理工学院、哈佛和康奈尔的大型机器学习课程的最后考试,并在人类层面产生新的问题。 最近, 程序合成和少镜头学习解决大学一级的问题在数学和STEM课程中设置了问题。 在这项工作中, 我们用多种方式解决与问题组不同的最后考试问题: 问题更长, 具有多个部分, 更复杂, 跨一系列更广泛的问题组。 我们为自动回答这些问题和新问题的机器学习提供机器最后考试和代码。 我们使用自动检查器来回答数学和STEM课程中的多种选择问题, 评估大型免费语言模型, Meta's Ob, 与公开的GPT3, 查特GPT, 和 Codexx 比较结果, 比较质量、适当性和难度。 我们用机器生成的校正问题, 我们用机器的校正的校正的一些问题进行最精确的学习。