Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.
翻译:视觉语言模型在理解视觉内容方面展现出卓越能力,但其空间处理中的系统性偏差尚未得到充分探索。本研究识别并描述了一种系统性空间注意力偏差:在水平拼接的图像中,视觉语言模型始终优先描述左侧内容而非右侧内容。通过对开源与闭源模型进行图像对的受控实验,我们证明该偏差在不同架构中持续存在——在中性提示条件下,模型在约97%的案例中均优先描述左侧内容。对阿拉伯语微调模型的测试表明,即使经过从右向左的语言训练,该偏差依然存在,从而排除了语言阅读方向作为主要成因的可能性。对PixMo和Visual Genome训练数据标注指南的调查显示,其中并无明确的左优先排序指令,表明该偏差与架构因素而非显式训练数据指令具有一致性。这些发现揭示了当前视觉语言模型在处理空间信息时存在根本性局限。