We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets.
翻译:我们提出了一个从大型未贴标签的视频数据中学习视频演示的新方法。 理想的情况是, 这种演示将是通用的和可转让的, 直接用于行动识别和零光或少镜头学习等新任务。 我们将无监督的演示学习作为一种多模式、多任务学习问题, 其表现形式通过蒸馏方式在不同模式中共享。 此外, 我们引入了损失函数演变的概念, 方法是使用进化搜索算法, 自动找到损失函数的最佳组合, 捕捉许多( 自我监督的) 任务和模式。 第三, 我们基于 Zipf 的法律, 提出一个使用大型未贴标签数据集配对作为先前制约的不受监督的演示评价指标。 这种不受监督的制约性( 不受任何标签指导) 产生类似的结果, 导致薄弱的、 监督的、 特定任务的数据。 拟议的未经监督的演示学习结果在单一的 RGB 网络中出现, 超越了先前的方法 。 值得注意的是, 也比若干基于标签的方法( 如图像网) 更有效,, 例外是大型、 完全贴标签的视频数据集 。