This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts.
翻译:本文提出一个新的借口任务,以解决自我监督的视频代表学习问题。 具体地说, 在未贴标签的视频片段下, 我们计算了一系列片段时空统计摘要, 如最大运动的空间位置和主方向、空间位置和时间轴上最大颜色多样性的主要颜色等。 然后, 建立并培训神经网络, 以生成以视频框架作为投入的统计摘要。 为了减轻学习困难, 我们使用几种空间分隔模式来编码粗糙的空间位置, 而不是精确的空间碳酸盐座标。 我们的方法受到以下观察的启发: 人类视觉系统对视觉领域内容迅速变化的敏感度, 只需要对粗略空间位置的印象来理解视觉内容。 为了验证拟议方法的有效性, 我们与四个3D骨干网络, 即C3D、 3D-ResNet、 R( 2+1) D 和 S3D- G。 结果显示, 我们的方法超越了这些主干网络在四个下游视频分析任务上的现有方法, 包括行动识别、 视频检索/ 动态图像工具库 。