Understanding the structure of complex activities in untrimmed videos is a challenging task in the area of action recognition. One problem here is that this task usually requires a large amount of hand-annotated minute- or even hour-long video data, but annotating such data is very time consuming and can not easily be automated or scaled. To address this problem, this paper proposes an approach for the unsupervised learning of actions in untrimmed video sequences based on a joint visual-temporal embedding space. To this end, we combine a visual embedding based on a predictive U-Net architecture with a temporal continuous function. The resulting representation space allows detecting relevant action clusters based on their visual as well as their temporal appearance. The proposed method is evaluated on three standard benchmark datasets, Breakfast Actions, INRIA YouTube Instructional Videos, and 50 Salads. We show that the proposed approach is able to provide a meaningful visual and temporal embedding out of the visual cues present in contiguous video frames and is suitable for the task of unsupervised temporal segmentation of actions.
翻译:了解未剪辑的视频中复杂活动的结构是行动识别领域的一项艰巨任务。 问题之一是,这项任务通常需要大量手持附加说明的分钟甚至小时视频数据,但指出这些数据非常耗时,无法自动或缩放。 为解决这一问题,本文件提议了一种方法,用于在联合视觉和时空嵌入空间的基础上,在未剪辑的视频序列中不受监督地学习行动。 为此,我们结合了基于预测性的U-Net结构的视觉嵌入,并具有时间持续功能。 由此产生的代表空间可以探测基于其视觉和时间外观的相关行动集群。 提议的方法用三个标准基准数据集(Mreaster Actions、INRIA YouTube教学视频和50 Salads)进行评估。 我们表明,拟议的方法能够提供有意义的视觉和时间嵌入相毗连的视频框架所显示的视觉提示,并适合于不超超超时间的行动时间分割任务。