Summarization systems are ultimately evaluated by human annotators and raters. Usually, annotators and raters do not reflect the demographics of end users, but are recruited through student populations or crowdsourcing platforms with skewed demographics. For two different evaluation scenarios -- evaluation against gold summaries and system output ratings -- we show that summary evaluation is sensitive to protected attributes. This can severely bias system development and evaluation, leading us to build models that cater for some groups rather than others.
翻译:总结系统最终由人类通知员和评分员评价,通常,通知员和评分员不反映最终用户的人口统计,而是通过学生群或人口偏斜的众包平台征聘的。对于两种不同的评价设想 -- -- 黄金摘要和系统产出评级的评价 -- -- 我们显示,简要评价对受保护的属性十分敏感。这可能会严重偏差系统开发和评价,导致我们建立适合某些群体而不是其他群体的模型。