Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be released soon.
翻译:视频中的时间语言定位旨在将与给定查询句子相关的时间间隔定位为本地化。 先前的方法将它作为边界回归任务或抽取任务处理 。 本文将提出时间语言作为视频阅读理解的基础, 并提出一个用于解决这一问题的“ 关系认知网络( RaNet) ” ( RaNet) 。 这个框架旨在从预定义的答案组中选择一个视频时刻选择, 辅助粗糙和松软的选择- 询问互动和选择- 选择- 选择关系构建。 提议了一个选择- 查询互动器, 以同时匹配在句子移动和代号移动级别上的视觉和文本信息, 导致一个粗略和松动的跨模式互动 。 此外, 通过利用图形变动来捕捉视频时刻选择对最佳选择的依赖性, 引入了一个新的多选择关系构建器 。 活动网络定位、 TACos 和 Charades- STA 的广泛实验将很快发布我们解决方案的有效性 。