Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.
翻译:多模态视觉语言模型(VLM)在推理过程中常出现幻觉问题,因为模型往往仅对图像进行浅层观察后即生成解释。我们提出 \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights),一种通过事后验证思维链输出来增强推理可靠性的框架。给定 VLM 生成的推理依据,CoRGI 将其分解为逐步陈述,将每一步在视觉证据中进行定位,并在生成最终答案前过滤或修正缺乏支持的论断。在五个具有挑战性的基准测试——VCR、ScienceQA、MMMU、MathVista 和 HallusionBench——上的实验表明,CoRGI 在使用多种 VLM 骨干模型(包括 Qwen-2.5VL、LLaVA-1.6 和 Gemma3-12B)时,均能持续提升答案准确性和解释的忠实度。除了量化指标的提升,定性分析进一步说明了验证过程如何减少幻觉并增强可解释性,这表明事后视觉定位是构建更可信、更透明的多模态推理系统的一个有前景的方向。