Questions that require counting a variety of objects in images remain a major challenge in visual question answering (VQA). The most common approaches to VQA involve either classifying answers based on fixed length representations of both the image and question or summing fractional counts estimated from each section of the image. In contrast, we treat counting as a sequential decision process and force our model to make discrete choices of what to count. Specifically, the model sequentially selects from detected objects and learns interactions between objects that influence subsequent selections. A distinction of our approach is its intuitive and interpretable output, as discrete counts are automatically grounded in the image. Furthermore, our method outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.
翻译:需要计算图像中各种对象的问题仍然是视觉问题解答(VQA)中的一个重大挑战。对于VQA,最常见的方法是,根据图像和问题固定长度的表达方式对答案进行分类,或者对图像每一部分的估计分数进行总结。相反,我们把计数当作一个顺序决定过程,迫使我们的模型对点数作出独立的选择。具体地说,该模型从检测到的天体中按顺序选择,并学习影响随后选择的天体之间的相互作用。我们方法的一个区别是其直观和可解释的输出,因为离散的计数自动以图像为基础。此外,我们的计算方法在评估计数的多维度上优于VQA的艺术结构状态。