In this paper, we tackle the problem of video alignment, the process of matching the frames of a pair of videos containing similar actions. The main challenge in video alignment is that accurate correspondence should be established despite the differences in the execution processes and appearances between the two videos. We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network. Then the features are processed and combined to construct a multidimensional time series that represent the video. The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW). The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it. For evaluation, we considered video synchronization and phase classification tasks on the Penn action dataset. Also, for an effective evaluation of the video synchronization task, we present a new metric called Enclosed Area Error(EAE). The results show that our method outperforms previous state-of-the-art methods, such as TCC and other self-supervised and supervised methods.
翻译:在这篇论文中,我们解决了视频对齐问题,即匹配包含相似动作的一对视频的帧的过程。视频对齐的主要挑战在于要建立准确的对应关系,尽管两个视频之间的执行过程和外观存在差异。我们引入了一种无监督的对齐方法,该方法使用帧的全局和局部特征。具体来说,我们通过三种机器视觉工具(人物检测,姿态估计和VGG网络)为每个视频帧引入了有效的特征。然后,处理和组合这些特征以构建表示视频的多维时间序列。使用Diagonalized Dynamic Time Warping(DDTW)的新版本,利用生成的时间序列对相同动作的视频进行对齐。我们方法的主要优点是不需要训练,这使得它适用于任何新类型的动作,而无需收集其训练样本。为了评估,我们在Penn动作数据集上考虑了视频同步和相位分类任务。另外,针对有效评估视频同步任务,我们介绍了一个新的指标,称为Enclosed Area Error(EAE)。结果表明,我们的方法优于以前的最先进方法,例如TCC和其他自监督和有监督方法。