Visual Question Answering (VQA) is a novel problem domain where multi-modal inputs must be processed in order to solve the task given in the form of a natural language. As the solutions inherently require to combine visual and natural language processing with abstract reasoning, the problem is considered as AI-complete. Recent advances indicate that using high-level, abstract facts extracted from the inputs might facilitate reasoning. Following that direction we decided to develop a solution combining state-of-the-art object detection and reasoning modules. The results, achieved on the well-balanced CLEVR dataset, confirm the promises and show significant, few percent improvements of accuracy on the complex "counting" task.
翻译:视觉问题解答(VQA)是一个新颖的问题领域,为了解决自然语言形式的任务,必须处理多种模式的投入。由于解决方案本身要求将视觉和自然语言处理与抽象推理相结合,因此问题被视为是完全的。最近的进展表明,使用从投入中提取的高层次、抽象事实可能有助于推理。遵循这一方向,我们决定制定一个将最新物体探测和推理模块相结合的解决办法。在平衡的CLEVR数据集上取得的成果证实了承诺,并表明复杂“计算”任务的准确性显著提高,但增幅甚微。