The creation of relevance assessments by human assessors (often nowadays crowdworkers) is a vital step when building IR test collections. Prior works have investigated assessor quality & behaviour, though into the impact of a document's presentation modality on assessor efficiency and effectiveness. Given the rise of voice-based interfaces, we investigate whether it is feasible for assessors to judge the relevance of text documents via a voice-based interface. We ran a user study (n = 49) on a crowdsourcing platform where participants judged the relevance of short and long documents sampled from the TREC Deep Learning corpus-presented to them either in the text or voice modality. We found that: (i) participants are equally accurate in their judgements across both the text and voice modality; (ii) with increased document length it takes participants significantly longer (for documents of length > 120 words it takes almost twice as much time) to make relevance judgements in the voice condition; and (iii) the ability of assessors to ignore stimuli that are not relevant (i.e., inhibition) impacts the assessment quality in the voice modality-assessors with higher inhibition are significantly more accurate than those with lower inhibition. Our results indicate that we can reliably leverage the voice modality as a means to effectively collect relevance labels from crowdworkers.
翻译:通过人类评估员(如今通常是众包工人)创建相关性评估是构建信息检索测试集的至关重要的步骤。以往的研究调查了评估员的质量和行为,但没有考虑到文档呈现模态对评估员效率和效果的影响。考虑到基于语音接口的兴起,本研究调查了评估员是否可以通过语音接口判断文本文档的相关性的可行性。我们在一个众包平台上进行了用户研究(n=49),参与者评估了抽样自TREC深度学习语料库的短文档和长文档,以文字或语音形式呈现给他们。我们发现:(i)参与者在文本和语音模态下的判断准确性相同;(ii)随着文档长度的增加,在语音条件下进行相关性判断需要的时间明显更长(对于长度>120个单词的文档,需要的时间几乎增长了一倍);(iii)评估员忽略不相关刺激的能力(即抑制)对语音模态下的评估质量有影响-抑制更高的评估员比抑制较低的评估员更准确。我们的研究结果表明,我们可以可靠地利用语音模态作为从众包工人中有效收集相关标签的手段。