Long-form generation has become a critical and challenging application for Large Language Models (LLMs). Existing studies are limited by their reliance on scarce, high-quality long-form response data and their focus on coarse-grained, general-purpose metrics (e.g., coherence and helpfulness), overlooking the nuanced, scenario-specific requirements of real-world tasks. To address these limitations, we propose a framework utilizing Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first decomposes each instruction into a set of fine-grained, adaptive constraint criteria spanning key dimensions of long-form generation tasks. Subsequently, we design a reward mechanism to quantify the response quality based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we leverage reinforcement learning to optimize LLMs using these fine-grained signals. Experimental results show that ACE-RL significantly outperforms existing SFT and RL baselines by 18.63% and 7.61% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 8.76%, providing a more effective training paradigm in long-form generation scenarios.
翻译:长文本生成已成为大型语言模型(LLM)的关键且具有挑战性的应用。现有研究受限于对稀缺高质量长文本响应数据的依赖,以及其关注于粗粒度通用指标(如连贯性和有用性),忽视了现实任务中细致且场景特定的需求。为应对这些局限,我们提出了一个利用自适应约束增强奖励的长文本生成强化学习框架(ACE-RL)。ACE-RL首先将每条指令分解为一组细粒度、自适应的约束准则,这些准则覆盖长文本生成任务的关键维度。随后,我们设计了一种奖励机制,基于响应满足相应约束的程度来量化其质量,从而将主观质量评估转化为约束验证。最后,我们利用强化学习,基于这些细粒度信号优化LLM。实验结果表明,在WritingBench基准上,ACE-RL显著优于现有的监督微调(SFT)和强化学习(RL)基线方法,分别提升18.63%和7.61%;我们表现最佳的模型甚至超越了GPT-4o等专有系统8.76%,为长文本生成场景提供了一种更有效的训练范式。