The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. We also demonstrate that our proposed approach generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework
翻译:大多数传统的文本到视频检索系统在静态环境中运作,即用户和代理商之间除了用户提供的初始文本查询之外没有互动。如果初始查询有模糊之处,可能导致许多错误检索视频,那么这种互动可能是次最佳的。为了克服这一限制,我们提议了一个使用对话框(VireD)的视频检索检索新框架,使用户能够通过多轮对话与一个AI代理商互动,用户通过回答一个AI代理商提出的问题来改进检索的结果。我们的新颖的多式联运问题生成者学会问问题,以便利用(一) 在与用户的上一轮互动中检索到的视频候选人最大限度地实现随后的视频检索性能,以及(二) 记录所有先前互动的基于文本的对话框史,从而产生包含与视频检索相关的视觉和语言提示的问题。此外,为了产生最大程度的信息引导器(IGS),指导问题生成者通过回答一个能够提高随后视频检索准确性的问题。我们验证了我们在AVRED框架上的交互式 ViReD框架的有效性,我们用它来大大地展示真实的图像检索系统。