聊天的数学能力 (Mathematical Capabilities of ChatGPT)

We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as well as hand-crafted ones, and measuring its performance against other models trained on a mathematical corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional mathematicians by emulating various use cases that come up in the daily professional activities of mathematicians (question answering, theorem searching). In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, only cover elementary mathematics. We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new dataset publicly available to assist a community-driven comparison of ChatGPT with (future) large language models in terms of advanced mathematical comprehension. We conclude that contrary to many positive reports in the media (a potential case of selection bias), ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Our results show that ChatGPT often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to pass a university exam, you would be better off copying from your average peer!

翻译：我们通过在公开可得的数据集以及手工艺的数据集上测试查特GPT的数学能力,并根据在数学本体(例如Minerva)上培训的其他模型(例如Minerva)衡量其表现,以此来测试查特GPT的数学能力。我们还测试查特GPT能否通过模拟数学家日常专业活动中出现的各种使用案例(回答问题,理论搜索)来成为专业数学家的有用助手。与正规数学相比,在正规数学中,有大量正式证据数据库(例如,Lean数学图书馆),当前用于衡量语言模型的自然数学数据集,仅包括初级数学。我们通过引入新的数据集来解决这一问题:GHOSTS。这是数学研究者首次制作和整理的自然语文数据集,(1) 旨在涵盖研究生一级的数学,(2) 全面概述语言模型的数学能力。我们将查特GHOSTS作为基准,对照精确的大学标准来评估成绩。我们通过公开这种新的数据集,常常帮助社区驱动对查普的数学模型进行比较。我们从高校的数学分数级的数学能力中,我们用高校的数学成绩标准来得出高校的成绩。

相关内容

ChatGPT

关注 257

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日