In this work, we investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning. Our exploration mainly comprises of following three perspective: First, through offline data curation, we analyze the U-shaped difficulty distribution of two given datasets using the base model by multi-round sampling, and then filter out prompts that are either too simple or extremely difficult to provide meaningful gradients and perform subsequent two-stage training. Second, we implement an online advantage differentiation, computing group-wise empirical accuracy as a difficulty proxy to adaptively reweight advantages estimation, providing stronger learning signals for more challenging problems. Finally, we introduce difficulty hints as explicit prompts for more complex samples in the second training stage, encouraging the model to calibrate its reasoning depth and perform reflective validation checks. Our comprehensive approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
翻译:本研究探讨了显式建模问题难度先验信息如何影响基于强化学习的多模态推理微调效果。我们的探索主要包含以下三个层面:首先,通过离线数据筛选,我们利用基础模型对两个给定数据集进行多轮采样,分析其U型难度分布,随后过滤掉过于简单或极端困难、无法提供有效梯度的提示样本,并执行后续两阶段训练。其次,我们实施在线优势差分策略,通过计算组别经验准确率作为难度代理指标,自适应地重新加权优势估计,为更具挑战性的问题提供更强的学习信号。最后,在第二阶段训练中,我们为更复杂的样本引入难度提示作为显式指令,激励模型校准其推理深度并执行反思性验证检查。我们的综合方法仅使用2K+0.6K两阶段训练数据,就在多个多模态数学推理基准测试中展现出显著性能提升。