Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.
翻译:大型音频-语言模型(LALMs)常被用于涉及对有序选项进行推理的任务中。一个悬而未决的问题是:其预测是否会受到答案选项排列顺序的影响?若存在此类影响,则表明模型存在一种选择偏差,进而削弱其可靠性。本文针对LALMs中的这一问题进行了识别与分析。我们通过对六个LALMs在三个广泛使用的基准测试集及其语音版本上进行大量实验,证明所有模型均无法完全避免此类偏差。打乱答案选项的顺序可导致模型性能波动高达24%,甚至改变模型间的排名顺序,这引发了人们对当前评估方法可靠性的担忧。我们还研究了基于排列的策略,并证明其在多数情况下能够缓解偏差。本研究首次对LALMs中的该问题进行了系统性探究,我们期望这项工作能提升学界对该问题的关注度,并推动该方向的进一步研究。