步骤标记：通过步骤监控实现语言推理模型生成控制 (Step-Tagging: Toward controlling the generation of Language Reasoning Models through step monitoring)

The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. To address this challenge, we introduce the Step-Tagging framework, a lightweight sentence-classifier enabling real-time annotation of the type of reasoning steps that an LRM is generating. To monitor reasoning behaviors, we introduced ReasonType: a novel taxonomy of reasoning steps. Building on this framework, we demonstrated that online monitoring of the count of specific steps can produce effective interpretable early stopping criteria of LRM inferences. We evaluate the Step-tagging framework on three open-source reasoning models across standard benchmark datasets: MATH500, GSM8K, AIME and non-mathematical tasks (GPQA and MMLU-Pro). We achieve 20 to 50\% token reduction while maintaining comparable accuracy to standard generation, with largest gains observed on more computation-heavy tasks. This work offers a novel way to increase control over the generation of LRMs, and a new tool to study behaviors of LRMs.

翻译：语言推理模型（LRMs）领域在过去几年中非常活跃，训练和推理技术的进步使得LRM能够进行更长、更准确的推理。然而，越来越多的研究表明，LRM仍然效率低下，过度生成验证和反思步骤。为应对这一挑战，我们引入了步骤标记框架，这是一种轻量级句子分类器，能够实时标注LRM生成的推理步骤类型。为监控推理行为，我们提出了ReasonType：一种新颖的推理步骤分类法。基于此框架，我们证明了对特定步骤数量的在线监控能够产生有效的可解释早期停止标准，用于LRM推理。我们在三个开源推理模型上评估了步骤标记框架，测试数据集包括标准基准：MATH500、GSM8K、AIME以及非数学任务（GPQA和MMLU-Pro）。在保持与标准生成相当准确率的同时，我们实现了20%至50%的标记减少，其中计算密集型任务收益最大。这项工作提供了一种增强LRM生成控制的新方法，以及研究LRM行为的新工具。