Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these approaches from satisfactory results. In this paper, we take one step further to enable effective spatio-temporal learning in videos. We propose a novel Trajectory-aware Transformer for Video Super-Resolution (TTVSR). In particular, we formulate video frames into several pre-aligned trajectories which consist of continuous visual tokens. For a query token, self-attention is only learned on relevant visual tokens along spatio-temporal trajectories. Compared with vanilla vision Transformers, such a design significantly reduces the computational cost and enables Transformers to model long-range features. We further propose a cross-scale feature tokenization module to overcome scale-changing problems that often occur in long-range videos. Experimental results demonstrate the superiority of the proposed TTVSR over state-of-the-art models, by extensive quantitative and qualitative evaluations in four widely-used video super-resolution benchmarks. Both code and pre-trained models can be downloaded at https://github.com/researchmm/TTVSR.
翻译:视频超级分辨率(VSR)旨在从低分辨率(LR)对应方恢复一系列高分辨率(HR)框架。虽然取得了一些进展,但在有效利用整个视频序列中的时间依赖方面存在重大挑战。现有方法通常与有限的相邻框架(如5或7框架)相匹配并汇总视频框架,这使得这些方法无法取得令人满意的结果。在本文中,我们进一步迈出一步,以便能够在视频中进行有效的瞬时学习。我们为视频超分辨率(TTVSR)提议了一个新型的跨比例式变换器。特别是,我们将视频框架设计成若干先对齐的轨迹,其中包括连续的视觉标志。对于查询信号而言,自我使用仅是在短视-运动轨迹(如5或7框架)上的相关视觉标志上学习的。与香草视觉变换器相比,这样的设计大大降低了计算成本,使变换器能够模拟远程特征。我们进一步提议一个跨比例化特征化模块,以克服在远程视频视频上经常出现的规模变换问题。实验结果只能通过远程视频视频/高分辨率模型广泛展示高压的磁标标。