This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
翻译:本文通过群体相对策略优化,对统一多模态大语言模型(ULMs)进行强化学习(RL)的开拓性探索,旨在同时增强生成与理解能力。通过系统的先导研究,我们揭示了ULMs在共享策略优化框架内实现双重能力协同共演的巨大潜力。基于这一洞见,我们提出了CoRL,一个协同强化学习框架,包含用于联合优化的统一RL阶段和用于任务特定增强的精炼RL阶段。采用所提出的CoRL,我们得到的模型ULM-R1在三个文本到图像生成数据集上平均提升7%,在九个多模态理解基准上平均提升23%。这些结果证明了CoRL的有效性,并凸显了强化学习在促进ULMs跨任务协同与优化方面的显著优势。代码发布于https://github.com/mm-vl/ULM-R1。