In the majority of the existing Visual Question Answering (VQA) research, the answers consist of short, often single words, as per instructions given to the annotators during dataset construction. This study envisions a VQA task for natural situations, where the answers are more likely to be sentences rather than single words. To bridge the gap between this natural VQA and existing VQA approaches, a novel unsupervised keyword extraction method is proposed. The method is based on the principle that the full-sentence answers can be decomposed into two parts: one that contains new information answering the question (i.e., keywords), and one that contains information already included in the question. Discriminative decoders were designed to achieve such decomposition, and the method was experimentally implemented on VQA datasets containing full-sentence answers. The results show that the proposed model can accurately extract the keywords without being given explicit annotations describing them.
翻译:在大多数现有的视觉问答研究中,答案由简短的、往往是单一的单词组成,这是按照数据集构建期间给说明者的指示。本研究设想了一种针对自然情况的VQA任务,在自然情况下,答案更有可能是句子,而不是单词。为了缩小这种自然的VQA和现有的VQA方法之间的差距,提出了一种新的、不受监督的关键词提取方法。方法基于的原则是,完全关键词回答可以分解成两部分:一部分包含回答问题的新信息(即关键词),另一部分包含问题中已经包含的信息。区别性解析器的设计是为了实现这种分解,该方法在包含全方位答案的VQA数据集上实验性地实施。结果显示,拟议的模型可以准确提取关键词,而没有给出明确的描述这些关键词的说明。