Mathematical reasoning serves as a crucial testbed for the intelligence of large language models (LLMs), and math word problems (MWPs) are a popular type of math problems. Most MWP datasets consist of problems containing only the necessary information, while problems with distracting and excessive conditions are often overlooked. Prior works have tested popular LLMs and found a dramatic performance drop in the presence of distracting conditions. However, datasets of MWPs with distracting conditions are limited, and most suffer from lower levels of difficulty and out-of-context expressions. This makes distracting conditions easy to identify and exclude, thus reducing the credibility of benchmarking on them. Moreover, when adding distracting conditions, the reasoning and answers may also change, requiring intensive labor to check and write the solutions. To address these issues, we design an iterative framework to generate distracting conditions using LLMs. We develop a set of prompts to revise MWPs from different perspectives and cognitive levels, encouraging the generation of distracting conditions as well as suggestions for further revision. Another advantage is the shared solutions between original and revised problems: we explicitly guide the LLMs to generate distracting conditions that do not alter the original solutions, thus avoiding the need to generate new solutions. This framework is efficient and easy to deploy, reducing the overhead of generating MWPs with distracting conditions while maintaining data quality.
翻译:数学推理是检验大语言模型智能水平的重要基准,而数学应用题则是数学问题中的常见类型。现有数学应用题数据集大多仅包含解题必需信息,而含有干扰性和冗余条件的问题常被忽视。先前研究已测试主流大语言模型,发现其在干扰条件存在时性能显著下降。然而,当前含干扰条件的数学应用题数据集规模有限,且多数存在难度偏低、表述脱离语境的问题。这使得干扰条件易于识别和排除,从而降低了基于此类数据的基准测试可信度。此外,添加干扰条件时,推理过程和答案可能随之改变,需要大量人工劳动来核验并撰写解题步骤。为解决这些问题,我们设计了一种基于大语言模型的迭代式干扰条件生成框架。我们开发了一套提示词模板,从不同视角和认知层级对数学应用题进行修订,既能生成干扰条件,也能提供进一步修改建议。该框架的另一优势在于保持原问题与修订问题解题过程的一致性:我们明确引导大语言模型生成不改变原问题解的干扰条件,从而避免重新生成解题过程的需求。该框架高效且易于部署,在保证数据质量的同时,显著降低了生成含干扰条件数学应用题的成本。