When applying reinforcement learning--typically through GRPO--to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
翻译:将强化学习(通常通过GRPO)应用于大型视觉语言模型推理时,难以有效扩展推理长度,或在所有任务中产生冗长输出,而准确性仅获得边际提升。为解决此问题,我们提出了FAST-GRPO,这是一种GRPO变体,能够根据问题特征动态调整推理深度。通过实证分析,我们通过研究响应长度和数据分布如何影响性能,确立了快速-慢速思维在LVLMs中的可行性。受这些观察启发,我们引入了两个互补的指标来估计问题的难度,以指导模型决定何时采用快速或慢速思维更为合适。接着,我们将基于长度的自适应奖励和难度感知的KL散度整合到GRPO算法中。在七个推理基准测试上的实验表明,FAST实现了最先进的准确性,与基础模型相比获得了超过10%的相对提升,同时与之前的慢速思维方法相比,减少了32.7-67.3%的令牌使用量,有效平衡了推理长度与准确性。