Direct Preference Optimization (DPO) has proven effective in aligning large language models with human preferences but is often constrained to pairwise comparisons -- overlooking additional positive and negative responses that are commonly available in real-world settings. We propose Simultaneous Weighted Preference Optimization (SWEPO), which incorporates multiple responses per query and prioritizes those that deviate most from the average reward. This deviation-based weighting focuses training on the most informative outliers, akin to a built-in curriculum. Theoretically, we prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of $\mathcal{O}(\tfrac{1}{\sqrt{k}})$. Empirically, SWEPO outperforms state-of-the-art baselines on the Ultra-Feedback dataset and demonstrates substantial improvements over DPO and InfoNCA, yielding boosts of up to $\sim 4$% on length-controlled win-rate on AlpacaEval.
翻译:暂无翻译