Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.
翻译:基于人类反馈的强化学习(RLHF)被广泛用于对齐大型语言模型,但实践者始终面临一个难题:提升安全性往往会降低公平性,扩展到多样化人群在计算上变得难以处理,而增强系统鲁棒性常常会放大主流偏见。我们将这种张力形式化为对齐三难困境:没有任何RLHF系统能同时实现(i)跨多样化人类价值观的ε-代表性,(ii)样本与计算复杂度的多项式可处理性,以及(iii)对抗性扰动与分布偏移的δ-鲁棒性。通过融合统计学习理论与鲁棒优化的复杂性理论分析,我们证明:为全球规模人群同时实现代表性(ε ≤ 0.01)与鲁棒性(δ ≤ 0.001)需要Ω(2^{d_context})次运算,这在上下文维度上呈超多项式增长。研究表明,当前RLHF实施方案通过牺牲代表性来解决这一困境:它们仅从同质标注者池收集10^3–10^4个样本,而真实全球代表性需要10^7–10^8个样本。我们的框架为已记录的RLHF病理现象(包括偏好坍缩、谄媚行为和系统性偏见放大)提供了统一解释。最后,我们通过策略性放宽对齐要求,提出了应对这些根本性权衡的具体方向。