We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.
翻译:本文提出了一种利用反馈驱动改进链进行偏好监督的语言模型微调方法及相应数据集。给定模型生成的响应,标注者通过标记"认可"与"不认可"的文本片段并提供具体评价依据,实现细粒度反馈。基础模型据此从左至右依次重写不认可的片段,形成渐进式改进序列。我们通过链中相邻步骤构建直接对齐的偏好配对,使模型能够从局部化、目标明确的编辑中学习。实验表明,该方法在性能上优于基于标准A/B偏好排序或完整对比重写的直接对齐方法,证明结构化、基于修订的监督机制能实现更高效、更有效的偏好调优。