Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.
翻译:视觉语言模型通过将大型语言模型与视觉输入相结合,已展现出卓越的能力。然而,它们往往未能充分利用视觉证据,要么在视觉中心任务中依赖语言先验,要么在推理过程中诉诸文本捷径。尽管强化学习可以使模型与期望行为对齐,但其在视觉语言模型中的应用一直受限于缺乏可扩展且可靠的奖励机制。为克服这一挑战,我们提出SSL4RL——一个创新框架,利用自监督学习任务作为基于强化学习的微调的可验证奖励来源。我们的方法将SSL目标(例如预测图像旋转或重建掩码图像块)重新构建为密集、自动的奖励信号,从而无需人类偏好数据或不可靠的AI评估器。实验表明,SSL4RL在视觉中心任务和视觉语言推理基准测试中均显著提升了性能。此外,通过系统性的消融研究,我们识别出影响SSL4RL任务有效性的关键因素——如任务难度、模型规模以及与目标领域的语义对齐——为未来工作提供了新的设计原则。我们还将该框架应用于图学习以证明其通用性,并取得了显著效果提升。SSL4RL建立了一个通用且有效的范式,利用可验证的自监督目标来对齐多模态模型。