关于扩展直接偏好优化以处理平局情况的研究 (On Extending Direct Preference Optimization to Accommodate Ties)

We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. We provide a theoretical explanation for this regularization effect using ideal DPO policy theory. We further show performance improvements over DPO in translation and mathematical reasoning using our DPO variants. We find it can be beneficial to include ties in preference optimization rather than simply discard them, as is done in common practice.

翻译：我们推导并研究了两种DPO变体，它们明确建模了在成对比较中声明平局的可能性。我们使用Rao和Kupper以及Davidson提出的两种知名建模扩展替代了DPO中的Bradley-Terry模型，这些扩展将概率分配给平局作为明确偏好的替代方案。我们在神经机器翻译和文本摘要中的实验表明，可以为这些DPO变体在数据集中添加明确标注的平局样本，而不会出现将相同平局对呈现给原始DPO时观察到的任务性能下降。我们通过经验发现，包含平局会通过KL散度测量导致相对于参考策略更强的正则化效果，即使在原始形式的DPO中也能观察到这一现象。我们利用理想DPO策略理论为这种正则化效应提供了理论解释。我们进一步展示了在翻译和数学推理任务中，我们的DPO变体相较于原始DPO实现了性能提升。研究发现，在偏好优化中包含平局样本而非按常规做法直接丢弃它们，可能带来有益效果。