This paper presents the 5th place solution by our team, y3h2, for the Meta CRAG-MM Challenge at KDD Cup 2025. The CRAG-MM benchmark is a visual question answering (VQA) dataset focused on factual questions about images, including egocentric images. The competition was contested based on VQA accuracy, as judged by an LLM-based automatic evaluator. Since incorrect answers result in negative scores, our strategy focused on reducing hallucinations from the internal representations of the VLM. Specifically, we trained logistic regression-based hallucination detection models using both the hidden_state and the outputs of specific attention heads. We then employed an ensemble of these models. As a result, while our method sacrificed some correct answers, it significantly reduced hallucinations and allowed us to place among the top entries on the final leaderboard.
翻译:本文介绍了我们团队y3h2在KDD Cup 2025 Meta CRAG-MM挑战赛中获得第五名的解决方案。CRAG-MM基准是一个专注于图像(包括第一人称视角图像)事实性问题的视觉问答数据集。该竞赛基于LLM驱动的自动评估器判定的VQA准确率进行评比。由于错误答案会导致负分,我们的策略侧重于减少视觉语言模型内部表示产生的幻觉。具体而言,我们利用hidden_state和特定注意力头的输出,训练了基于逻辑回归的幻觉检测模型,并采用了这些模型的集成方法。结果表明,虽然我们的方法牺牲了部分正确答案,但显著减少了幻觉现象,使我们在最终排行榜上位列前茅。