In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system's confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM -- each following distinct prompting strategies -- generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model's true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent's confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.
翻译:在视觉问答(VQA)与智能体人工智能(Agentic AI)的背景下,校准指的是人工智能系统对其答案的置信度与实际正确性之间的吻合程度。当此类系统在视觉不确定性条件下自主运行并必须做出决策时,这一方面变得尤为重要。尽管基于先进视觉语言模型(VLMs)的现代VQA系统因其提升的准确性,越来越多地应用于医疗诊断和自主导航等高风险领域,但其置信度估计的可靠性仍未得到充分检验。特别是,这些系统常常产生过度自信的响应。为解决此问题,我们提出了AlignVQA,一个基于辩论的多智能体框架,其中多样化的专用VLM——每个遵循不同的提示策略——生成候选答案,随后进行两阶段交互:通用智能体对这些提议进行批判、精炼和聚合。这一辩论过程产生的置信度估计能更准确地反映模型的真实预测性能。我们发现,校准程度更高的专用智能体能产生更对齐的置信度。此外,我们引入了一种新颖的可微分校准感知损失函数,称为aligncal,旨在通过最小化校准误差的上界来微调专用智能体。这一目标明确提升了每个智能体置信度估计的保真度。在多个基准VQA数据集上的实证结果证实了我们方法的有效性,显示出校准差异的显著减少。此外,我们提出了一种新颖的可微分校准感知损失,用于微调专用智能体,并基于最小化上界校准误差来提升其个体置信度估计的质量。