Sycophancy, an excessive tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence, poses a critical and underexplored challenge for multimodal large language models (MLLMs). While prior studies have examined this behavior in text-only settings of large language models, existing research on visual or multimodal counterparts remains limited in scope and depth of analysis. To address this gap, we introduce a comprehensive evaluation benchmark, \textit{PENDULUM}, comprising approximately 2,000 human-curated Visual Question Answering pairs specifically designed to elicit sycophantic responses. The benchmark spans six distinct image domains of varying complexity, enabling a systematic investigation of how image type and inherent challenges influence sycophantic tendencies. Through extensive evaluation of state-of-the-art MLLMs. we observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior. Furthermore, we propose novel metrics to quantify sycophancy in visual reasoning, offering deeper insights into its manifestations across different multimodal contexts. Our findings highlight the urgent need for developing sycophancy-resilient architectures and training strategies to enhance factual consistency and reliability in future MLLMs. Our proposed dataset with MLLMs response are available at https://github.com/ashikiut/pendulum/.
翻译:谄媚行为,即人工智能模型过度倾向于迎合用户输入,而牺牲事实准确性或违背视觉证据,是多模态大语言模型面临的一个关键且尚未被充分探索的挑战。尽管先前的研究已在纯文本环境下的大型语言模型中考察了这种行为,但针对视觉或多模态模型的研究在分析范围和深度上仍然有限。为填补这一空白,我们引入了一个全面的评估基准——\textit{PENDULUM},它包含约2000个人工精心策划的视觉问答对,专门设计用于诱发谄媚性回答。该基准涵盖六个不同复杂度的图像领域,能够系统性地研究图像类型及其固有挑战如何影响谄媚倾向。通过对最先进的多模态大语言模型进行广泛评估,我们观察到模型鲁棒性存在显著差异,并且对谄媚和幻觉行为表现出明显的易感性。此外,我们提出了新的量化指标来衡量视觉推理中的谄媚程度,从而更深入地理解其在不同多模态语境下的表现。我们的研究结果强调了开发抗谄媚架构和训练策略的迫切性,以增强未来多模态大语言模型的事实一致性和可靠性。我们提出的数据集及模型响应可在 https://github.com/ashikiut/pendulum/ 获取。