We introduce a weakly supervised method for representation learning based on aligning temporal sequences (e.g., videos) of the same process (e.g., human action). The main idea is to use the global temporal ordering of latent correspondences across sequence pairs as a supervisory signal. In particular, we propose a loss based on scoring the optimal sequence alignment to train an embedding network. Our loss is based on a novel probabilistic path finding view of dynamic time warping (DTW) that contains the following three key features: (i) the local path routing decisions are contrastive and differentiable, (ii) pairwise distances are cast as probabilities that are contrastive as well, and (iii) our formulation naturally admits a global cycle consistency loss that verifies correspondences. For evaluation, we consider the tasks of fine-grained action classification, few shot learning, and video synchronization. We report significant performance increases over previous methods. In addition, we report two applications of our temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.
翻译:我们采用了一种基于同一过程(例如人类行动)的时间顺序(例如视频)一致的、监管不力的代议制学习方法。主要的想法是,使用全球时间顺序顺序对称潜在通信的全球时间顺序排序作为监督信号。特别是,我们提出一种基于最佳顺序排序的亏损,以对嵌入网络进行培训。我们的损失是基于一种新颖的概率路径查找动态时间扭曲(DTW)的观点,其中包含以下三个关键特征:(一) 当地路径选择的路径选择具有对比性和可区别性;(二) 双向距离被描绘为具有对比性的概率,以及(三) 我们的表述自然承认全球周期一致性损失,以校验通信。关于评估,我们考虑了微分行动分类、少片段学习和视频同步等任务。我们报告,与以往方法相比,性能显著提高。此外,我们报告了我们的时间调整框架的两种应用,即3D构成重建以及微细微的视听检索。