We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.
翻译:我们从数据集 RAW-1K 中研究了商业上可得到的关于数学词问题的大型语言模型(LLM)的性能。 据我们所知,这是对ChatGPT的第一次独立评估。我们发现,ChatGPT的性能变化是基于显示其工作的要求,它提供工作的20%的时间没有达到,而没有提供工作的84%。关于MWP的其他几个因素与未知数量和操作数量有关,与前一个相比,导致失败概率较高的操作数量。我们特别注意到(在所有实验中),由于增加和减值作业的数量,失败概率直线增加。我们还公布了ChatGPT对MWP的答复数据集,以支持关于LLMM性能特征的进一步工作,并提出基线机器学习模型,以预测CatGPT能否正确回答MWP。我们发布了一套数据,其中包括CatGPT的响应,以支持该领域的进一步研究。</s>