Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.
翻译:大规模公共审议会产生数以千计的自由形式意见贡献,这些内容需要被综合成具有代表性且中立的摘要以供政策制定使用。虽然大语言模型已被证明是生成大规模审议摘要的有力工具,但它们也存在少数派观点代表性不足、对输入顺序产生偏见等风险,从而在高风险情境中引发公平性担忧。研究和解决这些问题需要在大规模上进行全面评估,然而当前实践往往依赖大语言模型作为评判者,其与人类判断的一致性较弱。为此,我们提出DeliberationBank——一个基于人类标注的大规模数据集,包含(1)由3000名参与者围绕十个审议问题生成的意见数据,以及(2)由4500名参与者在四个维度(代表性、信息量、中立性、政策认可度)标注的摘要评判数据。利用这些数据集,我们训练了DeliberationJudge——一个基于DeBERTa微调的模型,能够从个体视角评估审议摘要。与多种大语言模型评判者相比,DeliberationJudge在评估效率及与人类判断的一致性方面表现更优。借助DeliberationJudge,我们对18个大语言模型进行评估,揭示了审议摘要任务中持续存在的缺陷,特别是少数派立场代表性不足的问题。我们的框架为评估审议摘要提供了可扩展且可靠的方案,有助于确保人工智能系统在政策制定过程中更具代表性和公平性。