We propose TAIN (Transformers and Attention for video INterpolation), a residual neural network for video interpolation, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel vision transformer module, named Cross Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted interpolated frame. These CS features are then used to refine the interpolated prediction. To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. TAIN outperforms existing methods that do not require flow estimation and performs comparably to flow-based methods while being computationally efficient in terms of inference time on Vimeo90k, UCF101, and SNU-FILM benchmarks.
翻译:我们提出 " 视网膜 ",这是一个视频内插的残余神经网络,目的是将中间框架内插,给周围连续两个图像框架。我们首先提出一个新的视觉变压器模块,名为 " 交叉相似性 " (CS),以全球综合输入图像特征,其外观与预测的内插框架相类似。这些CS特征随后被用来改进内插预测。为了说明CS特征的包容性,我们提议一个图像注意模块,使网络能够从一个框架而不是另一个框架的CS特征上集中关注。 环境超越了不需要流量估计的现有方法,并采用与流动法相容的方法,同时在Vimeo90k、UCFC101和SNU-FILM基准方面计算效率很高。