可靠的视觉问题回答:不正确的回答而不能回答 (Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly)

Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for reliable VQA, where we prefer abstention over providing an incorrect answer. We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion. For that we explore several abstention approaches. We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain by directly using a model's softmax scores limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%). This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at 1% risk. While it is important to analyze both coverage and risk, these metrics have a trade-off which makes comparing VQA models challenging. To address this, we also propose an Effective Reliability metric for VQA that places a larger cost on incorrect answers compared to abstentions. This new problem formulation, metric, and analysis for VQA provide the groundwork for building effective and reliable VQA models that have the self-awareness to abstain if and only if they don't know the answer.

翻译：机器学习进展显著,缩小了在视觉问答(VQA)等多式联运任务中对人类的准确性差距。然而,虽然人类在不确定时可以说“我不知道”(即不回答一个问题),但这种能力在多式联运研究中在很大程度上被忽略,尽管这个问题对在真实环境中使用VQA很重要。在这项工作中,我们提倡为可靠的VQA制定问题配方,我们宁愿不提供错误的答案。我们首先为若干VQA模型提供弃权能力,并分析其覆盖范围、回答的问题部分和风险、部分问题的错误。我们探讨一些弃权方法。我们发现,尽管最佳执行模式在VQA v2数据集上实现了71%的准确性,但引入了直接使用模型软体积分数的选项,限制他们回答不到8%的问题,以降低错误风险(即只有1%)。这促使我们利用一个多媒体选择功能来直接估计预测的答案的正确性,回答问题部分和风险部分,我们从分析自我意识分析的覆盖面到质量标准1比值的准确性指标1比值。举例说,我们提出一个风险比值比值为1比值1比值。比值的VA的准确度,比值比值比值比值比值比值比值1比值比值比值为1比值为10比值的,比值比值比值比值比值比值比值比值比值比值比值比值比值。