The alignment of adjacent frames is considered an essential operation in video super-resolution (VSR). Advanced VSR models, including the latest VSR Transformers, are generally equipped with well-designed alignment modules. However, the progress of the self-attention mechanism may violate this common sense. In this paper, we rethink the role of alignment in VSR Transformers and make several counter-intuitive observations. Our experiments show that: (i) VSR Transformers can directly utilize multi-frame information from unaligned videos, and (ii) existing alignment methods are sometimes harmful to VSR Transformers. These observations indicate that we can further improve the performance of VSR Transformers simply by removing the alignment module and adopting a larger attention window. Nevertheless, such designs will dramatically increase the computational burden, and cannot deal with large motions. Therefore, we propose a new and efficient alignment method called patch alignment, which aligns image patches instead of pixels. VSR Transformers equipped with patch alignment could demonstrate state-of-the-art performance on multiple benchmarks. Our work provides valuable insights on how multi-frame information is used in VSR and how to select alignment methods for different networks/datasets. Codes and models will be released at https://github.com/XPixelGroup/RethinkVSRAlignment.
翻译:相邻框架的调整被视为视频超分辨率(VSR)中的一项基本操作。高级VSR模型,包括最新的VSR变异器,一般都配有设计完善的调整模块。然而,自备机制的进展可能违反这一常识。在本文件中,我们重新思考在VSR变异器中对齐的作用,并进行若干反直觉观察。我们的实验显示:(一) VSR变异器可以直接使用来自不对接视频的多框架信息,以及(二) 现有的调整方法有时对VSR变异器有害。这些观察显示,我们可以仅仅通过删除校准模块和采用更大的关注窗口来进一步改进VSR变异器的性能。然而,这种设计将极大地增加计算负担,无法处理大动作。因此,我们提议了一种称为补对齐的新的高效调整方法,即将图像补对齐而不是像素。装配的VSR变异器可以显示多基准的状态和艺术性能。我们的工作提供了宝贵的洞察力,说明VSR如何使用多框架信息,并在VSR中选择不同网络/Recommax/Regal DC 。