Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.
翻译:大型语言模型(LLMs)在解决复杂任务方面展现出卓越的能力,包括需要一定推理水平的任务。本文聚焦于状态追踪问题,即模型需要持续追踪多个实体所遵循的状态。为将状态追踪要素与其他因素分离,我们基于三个明确定义的状态追踪任务提出一个基准,并分析LLMs在不同场景下的表现。结果表明,新一代LLMs(特别是GPT-4和Llama3)能够有效追踪状态,尤其是在结合思维链等机制时。然而,前代模型虽然能理解任务并在初始阶段解决问题,但在经过一定步骤后往往无法完成该任务。