Direct Preference Optimization (DPO) is an efficient alternative to reinforcement learning from human feedback (RLHF), yet it typically assumes hard binary labels and pairwise comparisons. Such assumptions can be brittle under noisy or distribution-shifted supervision. We present Anchored Direct Preference Optimization (ADPO), which (i) incorporates soft preference probabilities, (ii) aligns policy updates through reference anchoring that induces an implicit trust region, and (iii) extends to listwise learning via Plackett-Luce modeling. In controlled synthetic setups covering 12 scenarios (4 noise types x 3 severities) and 3 model scales, ADPO exhibits relative improvements ranging from 12% to 79% over a standard DPO baseline (10-seed means; 95% CIs in the Appendix). Hard labels tend to fare better under severe noise, whereas soft labels yield better calibration under distribution shift; listwise variants achieve the highest WinMass (expected probability mass on the ground-truth best item) in 9/12 scenarios. Larger models amplify ADPO's benefits (0.718 vs. 0.416 at hidden=256), suggesting that anchoring acts as an effective trust-region regularizer. We release code and configurations to facilitate reproducibility.
 翻译:暂无翻译