We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as well as hand-crafted ones, and measuring its performance against other models trained on a mathematical corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional mathematicians by emulating various use cases that come up in the daily professional activities of mathematicians (question answering, theorem searching). In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, only cover elementary mathematics. We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new dataset publicly available to assist a community-driven comparison of ChatGPT with (future) large language models in terms of advanced mathematical comprehension. We conclude that contrary to many positive reports in the media (a potential case of selection bias), ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Our results show that ChatGPT often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to pass a university exam, you would be better off copying from your average peer!
翻译:我们通过在公开可得的数据集以及手工艺的数据集上测试查特GPT的数学能力,并根据在数学本体(例如Minerva)上培训的其他模型(例如Minerva)衡量其表现,以此来测试查特GPT的数学能力。我们还测试查特GPT能否通过模拟数学家日常专业活动中出现的各种使用案例(回答问题,理论搜索)来成为专业数学家的有用助手。与正规数学相比,在正规数学中,有大量正式证据数据库(例如,Lean数学图书馆),当前用于衡量语言模型的自然数学数据集,仅包括初级数学。我们通过引入新的数据集来解决这一问题:GHOSTS。这是数学研究者首次制作和整理的自然语文数据集,(1) 旨在涵盖研究生一级的数学,(2) 全面概述语言模型的数学能力。我们将查特GHOSTS作为基准,对照精确的大学标准来评估成绩。我们通过公开这种新的数据集,常常帮助社区驱动对查普的数学模型进行比较。我们从高校的数学分数级的数学能力中,我们用高校的数学成绩标准来得出高校的成绩。