We address the problem of retrieving a specific moment from an untrimmed video by natural language. It is a challenging problem because a target moment may take place in the context of other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they do not fully consider the temporal contexts between temporal moments. In this paper, we model the temporal context between video moments by a set of predefined two-dimensional maps under different temporal scales. For each map, one dimension indicates the starting time of a moment and the other indicates the duration. These 2D temporal maps can cover diverse video moments with different lengths, while representing their adjacent contexts at different temporal scales. Based on the 2D temporal maps, we propose a Multi-Scale Temporal Adjacent Network (MS-2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal contexts at each scale, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed MS-2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our MS-2D-TAN outperforms the state of the art.
翻译:我们用自然语言从未剪辑的视频中找到一个特定时刻的问题。 这是一个具有挑战性的问题, 因为目标时刻可能发生在未剪辑的视频中其他时间时刻的背景下。 现有方法无法很好地应对这一挑战, 因为它们没有充分考虑时间间隔之间的时间背景。 在本文中, 我们用不同时间尺度下一套预先定义的二维地图来模拟视频时间间隔之间的时间背景。 对于每个地图, 一个维度表示一个时刻的起始时间, 另一个维度表示时间长度。 这些 2D 时间地图可以覆盖不同长度的不同视频时刻, 同时在不同的时间尺度上代表其相邻环境。 基于 2D 时间尺度的地图, 我们建议建立一个多层次的双层温度相邻网络( MS-2D- TAN), 这是用于时空定位的单张框架 。 它能够将每个尺度的相邻时间环境进行校正, 同时学习将视频时刻与表示相匹配的区别性特征。 我们根据三个具有挑战性的基准, 即 Charades- STA、 actNet Captions, 以及 TACS, 在那里, 我们的 MS-2DAN 。