This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
翻译:本文针对联邦学习环境中大型语言模型与多样化人类偏好的对齐挑战展开研究,指出标准方法常难以充分表征多元观点。我们提出一个综合性评估框架,系统性地评估在采用不同人类偏好聚合策略时对齐质量与公平性之间的权衡关系。在联邦学习设定中,每个参与方在本地评估模型生成结果并产生奖励信号,服务器在不访问任何原始数据的情况下聚合这些群体级奖励。具体而言,我们评估了标准奖励聚合技术(最小值、最大值与平均值),并提出一种新颖的自适应方案,该方案能根据群体的历史对齐表现动态调整偏好权重。基于PPO的RLHF流程在问答任务上的实验表明,我们的自适应方法在保持竞争力对齐分数的同时,始终实现更优的公平性。本工作为评估LLM在多样化群体中的行为提供了稳健的方法论,并为开发真正多元化且公平对齐的模型提供了实用解决方案。