Existing commercial search engines often struggle to represent different perspectives of a search query. Argument retrieval systems address this limitation of search engines and provide both positive (PRO) and negative (CON) perspectives about a user's information need on a controversial topic (e.g., climate change). The effectiveness of such argument retrieval systems is typically evaluated based on topical relevance and argument quality, without taking into account the often differing number of documents shown for the argument stances (PRO or CON). Therefore, systems may retrieve relevant passages, but with a biased exposure of arguments. In this work, we analyze a range of non-stochastic fairness-aware ranking and diversity metrics to evaluate the extent to which argument stances are fairly exposed in argument retrieval systems. Using the official runs of the argument retrieval task Touch\'e at CLEF 2020, as well as synthetic data to control the amount and order of argument stances in the rankings, we show that systems with the best effectiveness in terms of topical relevance are not necessarily the most fair or the most diverse in terms of argument stance. The relationships we found between (un)fairness and diversity metrics shed light on how to evaluate group fairness -- in addition to topical relevance -- in argument retrieval settings.
翻译:现有商业搜索引擎往往难以代表搜索查询的不同视角。 参数检索系统处理搜索引擎的这一局限性,并就用户在有争议的议题(例如气候变化)上的信息需求提供正面(PRO)和负面(CON)视角。 这种参数检索系统的效力通常根据时事相关性和争论质量进行评估,而没有考虑到为争论立场(PRO或CON)所显示的文件数量往往不同。 因此,系统可能检索相关段落,但有偏颇的论点。 在这项工作中,我们分析一系列非随机公平认知的排名和多样性指标,以评估争议检索系统中对论据立场的正确暴露程度。 利用2020年CLEF 的参数检索任务正式运行,以及综合数据来控制排名中争论立场的数量和顺序。 我们表明,在主题相关性方面最有效的系统不一定是最公平或最多样化的争论立场。我们发现(不公正和多样性衡量尺度)之间的关系,还说明了如何评估小组的公平性。