Generating engaging content has drawn much recent attention in the NLP community. Asking questions is a natural way to respond to photos and promote awareness. However, most answers to questions in traditional question-answering (QA) datasets are factoids, which reduce individuals' willingness to answer. Furthermore, traditional visual question generation (VQG) confines the source data for question generation to single images, resulting in a limited ability to comprehend time-series information of the underlying event. In this paper, we propose generating engaging questions from multiple images. We present MVQG, a new dataset, and establish a series of baselines, including both end-to-end and dual-stage architectures. Results show that building stories behind the image sequence enables models to generate engaging questions, which confirms our assumption that people typically construct a picture of the event in their minds before asking questions. These results open up an exciting challenge for visual-and-language models to implicitly construct a story behind a series of photos to allow for creativity and experience sharing and hence draw attention to downstream applications.
翻译:在NLP社群中,热点内容的生成引起了近期的注意。 问问题是一种自然的对照片作出反应和提高认识的方法。 但是,对传统问答数据集中的问题的答案大多是事实类,这降低了个人回答意愿。 此外,传统的视觉问题生成(VQG)将问题生成源数据局限于单一图像,导致对相关事件的时间序列信息的理解能力有限。 在本文中,我们提议从多个图像中产生热点问题。我们展示了MVQG, 一个新的数据集, 并建立了一系列基线, 包括终端到终端和双阶段结构。结果显示,在图像序列中建立故事使模型能够产生热点问题,这证实了我们的假设,即人们通常在提出问题之前先在脑中构建事件图象。这些结果为视觉和语言模型隐含蓄地在一系列照片背后构建一个故事带来了令人兴奋的挑战,以便进行创造性和经验共享,从而引起对下游应用的注意。