This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
翻译:本文针对联邦学习环境中将大语言模型与多样化人类偏好对齐的挑战展开研究,指出标准方法往往难以充分代表多元观点。我们提出一个综合性评估框架,系统性地评估在使用不同人类偏好聚合策略时,对齐质量与公平性之间的权衡。在我们的联邦设定中,每个群体在本地评估模型输出并生成奖励信号,服务器在不访问任何原始数据的情况下聚合这些群体级奖励。具体而言,我们评估了标准奖励聚合技术(最小值、最大值和平均值),并引入一种新颖的自适应方案,该方案基于群体历史对齐表现动态调整偏好权重。我们在基于PPO的RLHF流程上进行的问答任务实验表明,我们的自适应方法在保持竞争力对齐分数的同时,始终实现更优的公平性。这项工作为评估大语言模型在多样化群体中的行为提供了稳健的方法论,并为开发真正多元且公平对齐的模型提供了实用解决方案。