Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.
翻译:优化带奖励的离散扩散模型仍具挑战:非自回归范式使得重要性采样难以处理且展开过程复杂,这困扰了如组相对策略优化等强化学习方法。本研究提出MaskGRPO,这是首个在离散扩散中实现可扩展多模态强化学习的可行方法,具备有效的重要性采样和模态特定适配。为此,我们首先厘清了离散扩散模型的理论基础,从而构建能捕捉梯度更新中有价值标记波动的重要性估计器。随后,我们针对视觉序列精心设计了展开方法,该方法能产生多样化的补全结果和可靠的优化梯度。在数学推理、代码生成和视觉生成基准测试中,MaskGRPO实现了更稳定高效的更新,从而获得更强的推理性能和更优的生成质量。本研究确立了MaskGRPO作为一种系统性策略优化方法,并为离散化视觉扩散提供了首个实用化途径。