As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel Cross-lingual and Cross-modal Factuality benchmark (CCFQA). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.
翻译:随着大语言模型(LLMs)在多语言环境中的日益普及,确保无幻觉的事实性变得尤为关键。然而,现有评估多模态大语言模型(MLLMs)可靠性的基准主要集中于文本或视觉模态,且以英语为主要关注点,这在处理多语言输入(尤其是语音)时形成了评估空白。为填补这一空白,我们提出了一个新颖的跨语言与跨模态事实性基准(CCFQA)。具体而言,CCFQA基准包含涵盖8种语言的平行语音-文本事实性问题,旨在系统评估MLLMs的跨语言与跨模态事实性能力。我们的实验结果表明,当前MLLMs在CCFQA基准上仍面临显著挑战。此外,我们提出了一种少样本迁移学习策略,能够有效将LLMs在英语中的问答(QA)能力迁移至多语言口语问答(SQA)任务,仅通过5样本训练即可达到与GPT-4o-mini-Audio相竞争的性能。我们发布CCFQA作为基础研究资源,以促进具备更鲁棒可靠语音理解能力的MLLMs的发展。我们的代码与数据集可在 https://github.com/yxduir/ccfqa 获取。