We humans regularly ask for clarification if we are confused when discussing the visual world, yet the commonplace requirement in visiolinguistic problems like Visual Dialog, VQA, and Referring Expression Comprehension is to force a decision based on a single, static language input. Since this assumption does not match human practice, we relax it and allow our model to request new language inputs to refine the prediction for a task. Through the exemplar task of referring expression comprehension, we formalize and motivate the problem, introduce an evaluation method, and propose \textit{Iterative Multiplication of Probabilities for Re-query Of Verbal Expressions} (IMPROVE) -- a re-query method that updates the model's prediction across multiple queries. We demonstrate IMPROVE on two different referring expression comprehension models and show it can improve accuracy by up to 6.23% without additional training or modification to the model's architecture.
翻译:我们人类在讨论视觉世界时经常要求澄清我们是否感到困惑,然而在视觉对话、VQA和引用表达理解等语言学问题方面常见的要求是强制作出基于单一静态语言输入的决定。由于这一假设与人类实践不符,我们放松了这一假设,并允许我们的模型要求新的语言投入,以完善对任务的预测。通过参考表达理解的示范任务,我们正式确定和激励问题,引入一种评估方法,并提出“语言表达再询问概率的逻辑乘法”(IMPROVE)——一种重新要求方法,以更新模型对多个问题的预测。我们用两种不同的表达理解模型演示IMPROVE,并表明它可以提高精度,达到6.23%,而无需对模型结构进行额外的培训或修改。