We consider the task of temporal human action localization in lifestyle vlogs. We introduce a novel dataset consisting of manual annotations of temporal localization for 13,000 narrated actions in 1,200 video clips. We present an extensive analysis of this data, which allows us to better understand how the language and visual modalities interact throughout the videos. We propose a simple yet effective method to localize the narrated actions based on their expected duration. Through several experiments and analyses, we show that our method brings complementary information with respect to previous methods, and leads to improvements over previous work for the task of temporal action localization.
翻译:我们考虑了在生活方式中人为时间行动定位的任务。我们引入了一套新数据集,其中包括1 200个视频剪辑中13 000个叙述的行动的时间时间定位手工说明。我们对这一数据进行了广泛的分析,从而使我们能够更好地了解语言和视觉模式在整个视频中是如何互动的。我们提出了一个简单而有效的方法,根据预期时间将说明的行动定位。我们通过一些实验和分析,表明我们的方法带来了与以往方法相关的补充信息,并导致改进了以往关于时间行动定位的工作。