Many tasks in video analysis and understanding boil down to the need for frame-based feature learning, aiming to encapsulate the relevant visual content so as to enable simpler and easier subsequent processing. While supervised strategies for this learning task can be envisioned, self and weakly-supervised alternatives are preferred due to the difficulties in getting labeled data. This paper introduces LRProp -- a novel weakly-supervised representation learning approach, with an emphasis on the application of temporal alignment between pairs of videos of the same action category. The proposed approach uses a transformer encoder for extracting frame-level features, and employs the DTW algorithm within the training iterations in order to identify the alignment path between video pairs. Through a process referred to as ``pair-wise position propagation'', the probability distributions of these correspondences per location are matched with the similarity of the frame-level features via KL-divergence minimization. The proposed algorithm uses also a regularized SoftDTW loss for better tuning the learned features. Our novel representation learning paradigm consistently outperforms the state of the art on temporal alignment tasks, establishing a new performance bar over several downstream video analysis applications.
翻译:视频分析和理解方面的许多任务归结为基于框架特征学习的需要,目的是包罗相关视觉内容,以便更简单、更方便地随后处理。虽然可以设想出受监督的学习任务战略,但由于难以获得标签数据,选择了自我和薄弱监督的替代方法。本文介绍了LRProp -- -- 一种新颖的、受微弱监督的演示学习方法,重点是在同一行动类别中的两对视频之间应用时间对齐。拟议方法使用变压器编码器来提取框架级别特征,并在培训版本中使用DTW算法,以便确定视频对齐之间的校准路径。通过被称为“Pair-wise position position” 的流程,这些对应每个位置的对应的概率分布与通过 KL- divergence 最小化的框架级别特征的相似性匹配。拟议算法还使用常规化的软DTW损失来更好地调整学习的特征。我们的新代谢式学习模式始终超越了时间校准任务的艺术状态,在下游的多个视频应用上建立了新的表现栏。