DiffAdapt：基于难度自适应的推理方法实现令牌高效的大型语言模型推断 (DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference)

Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.

翻译：近期涌现的推理型大型语言模型（LLMs）展现出卓越的问题解决能力，但往往生成冗长的思维轨迹，其实际效用尚不明确。本研究旨在提升其推理效率，使其无需过度思考即可达到高性能表现。首先，我们分析了推理轨迹中令牌概率的熵值。通过对三种模型的观测，我们发现一致的U形熵分布模式：简单问题虽准确率高但熵值较高，中等难度问题熵值较低，而困难问题因不确定性呈现高熵值。具体而言，我们观察到从简单区域到中等难度区域的熵值降低了22-25%，这表明模型在简单实例上存在"过度思考"现象。基于这些发现，我们提出了\textbf{DiffAdapt}——一个轻量级框架，可根据问题难度及推理轨迹熵值为每个问题选择简单/常规/困难三种推断策略。每种推断策略由固定的提示模板、温度参数和最大令牌长度构成。与现有效率优化方法不同，本方法无需对基础LLM进行微调，仅需训练一个可分类LLM最终隐藏状态的小型探针模型，从而实现低成本适配。我们在五个模型和八个基准测试上进行了全面评估。该方法在保持相当或更高准确率的同时，将令牌使用量降低达22.4%，为计算高效的推理提供了切实可行的技术路径。