When asked, current large language models (LLMs) like ChatGPT claim that they can assist us with relevance judgments. Many researchers think this would not lead to credible IR research. In this perspective paper, we discuss possible ways for LLMs to assist human experts along with concerns and issues that arise. We devise a human-machine collaboration spectrum that allows categorizing different relevance judgment strategies, based on how much the human relies on the machine. For the extreme point of "fully automated assessment", we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing two opposing perspectives - for and against the use of LLMs for automatic relevance judgments - and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR researchers. We hope to start a constructive discussion within the community to avoid a stale-mate during review, where work is dammed if is uses LLMs for evaluation and dammed if it doesn't.
翻译:Translated Abstract: 当问及当前的大型语言模型(如ChatGPT)是否可以辅助我们进行相关性判断时,许多研究人员认为这不会产生可信的信息检索研究结果。在本文中,我们讨论了大型语言模型辅助人类专家进行相关性判断的可能方式,同时也涉及到了可能出现的问题和关切点。我们构建了一个机器-人类合作的分析模型,以对不同的相关性判断策略进行分类,分类的标准是人类评估员需要多少依赖于机器的判定。在最极端的“完全自动化评估”点上,我们进一步进行了一个初步实验,以测试基于大型语言模型的相关性判断是否与训练有素的人类评估员一致。最后,我们提供了两种对于大型语言模型在自动化相关性判断方面的相对立的观点,以及一个妥协的视角,其中包括了我们对文献的分析、初步实验结果和作为信息检索研究人员的经验所得出的一些结论。希望本文能够在社区中引发建设性的讨论,以避免稿件在审阅过程中陷入僵局,要么因为使用了大型语言模型而被否决,要么因为没有使用而被否决。