为何强化微调能更好地保持多模态大语言模型的先验知识：数据视角的探究 (Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective)

Zhihao Zhang,Qiaole Dong,Qi Zhang,Jun Zhao,Enyu Zhou,Zhiheng Xi,Senjie Jin,Xiaoran Fan,Yuhao Zhou,Mingqi Wu,Yanwei Fu,Tao Ji,Tao Gui,Xuanjing Huang,Kai Chen

from arxiv, 28 pages (Preprint.)

Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on open-source multimodal model, Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model's probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a small magnitude of influence and are well aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. These findings suggest that distribution of training data, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.

翻译：监督微调（SFT）和强化微调（RFT）等后训练算法被广泛用于将多模态大语言模型适配至下游任务。尽管这些方法在任务适应上表现有效，但它们对先验知识的影响尚不明确。本文引入拼图任务作为现有预训练语料中未包含的新颖任务，并系统研究了SFT和RFT在开源多模态模型Qwen2.5-VL系列上的行为。实验揭示了一个显著的权衡：SFT能快速获得任务能力但会导致灾难性遗忘，而RFT学习速度较慢却能保持先验知识。我们通过学习动态研究了这一现象，通过分析训练数据对先验知识影响的幅度和方向。分析表明，RFT主要强化与基础模型概率分布自然对齐的正确样本，从而对先验知识产生较弱干扰。此外，在RFT模拟的轨迹上进行训练——这些轨迹对先验知识的影响幅度较小且方向高度一致——能使SFT在快速学习新任务的同时更好地保留先验知识。这些发现表明，训练数据的分布（而非算法差异）在遗忘过程中起核心作用，并凸显了RFT在多模态大语言模型中实现稳定持续学习的潜力。