Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded dialogue tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance and language cross-turn dependencies. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components in dialogues to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on a challenging video-grounded dialogue benchmark as well as a video QA benchmark.
翻译:神经模块网络(NMN)在合成图像的视觉问答(VQA)等基于图像的任务中取得了成功,但是,在视频对话任务中研究NMN的工作非常有限,这些任务扩大了传统视觉任务的复杂性,增加了视觉时间差异和语言的交叉依赖性。在近期NMN对图像背景任务的做法的推动下,我们引入了视频背景神经模块网络(VGNMN),以模拟视频背景语言任务的信息检索过程,作为神经模块的管道。VNMN首先在对话中将所有语言组成部分分离出来,以明确解决任何实体引用,并检测该问题中相应的基于行动的投入。所检测的实体和行动被用作即时神经模块网络的参数,并从视频中提取视觉提示。我们的实验显示,VNMNMN可以在具有挑战性的视频背景对话基准和视频QA基准上取得有希望的业绩。