We present a multiview pseudo-labeling approach to video learning, a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video. The complementary views help obtain more reliable pseudo-labels on unlabeled video, to learn stronger video representations than from purely supervised data. Though our method capitalizes on multiple views, it nonetheless trains a model that is shared across appearance and motion input and thus, by design, incurs no additional computation overhead at inference time. On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
翻译:我们展示了视频学习的多视角假标签方法,这是一个新颖的框架,使用外观和运动信息作为半监督视频学习的补充观点。 补充观点有助于在未贴标签的视频上获取更可靠的假标签,学习比纯监督数据更强的视频演示。 尽管我们的方法利用了多种观点,但它还是培养了一个在外观和运动输入中共享的模型,因此,从设计上看,在推断时间上不会产生额外的计算间接费用。 在多个视频识别数据集中,我们的方法大大优于其监管对口单位,并优于以往在自我监督视频代表学习中的标准基准工作。