Despite that convolution neural networks (CNN) have recently demonstrated high-quality reconstruction for video super-resolution (VSR), efficiently training competitive VSR models remains a challenging problem. It usually takes an order of magnitude more time than training their counterpart image models, leading to long research cycles. Existing VSR methods typically train models with fixed spatial and temporal sizes from beginning to end. The fixed sizes are usually set to large values for good performance, resulting to slow training. However, is such a rigid training strategy necessary for VSR? In this work, we show that it is possible to gradually train video models from small to large spatial/temporal sizes, i.e., in an easy-to-hard manner. In particular, the whole training is divided into several stages and the earlier stage has smaller training spatial shape. Inside each stage, the temporal size also varies from short to long while the spatial size remains unchanged. Training is accelerated by such a multigrid training strategy, as most of computation is performed on smaller spatial and shorter temporal shapes. For further acceleration with GPU parallelization, we also investigate the large minibatch training without the loss in accuracy. Extensive experiments demonstrate that our method is capable of largely speeding up training (up to $6.2\times$ speedup in wall-clock training time) without performance drop for various VSR models. The code is available at https://github.com/TencentARC/Efficient-VSR-Training.
翻译:尽管 convolution 神经网络(CNN)最近展示了高质量的超分辨率视频(VSR)重建,但高效培训竞争性VSR模型仍是一个具有挑战性的问题,通常比培训对应图像模型花费更多时间,导致较长的研究周期。现有的VSR方法通常从开始到结束都训练固定空间和时间大小的模型;固定规模通常为优异表现设定为大值,导致培训缓慢;然而,VSR是否需要这种僵硬的培训战略?在这项工作中,我们表明有可能以简单到硬的方式逐步培训小型至大型的空间/时空模型。特别是,整个培训分为几个阶段,早期培训阶段的空间形状较小。每个阶段的时间大小也随着空间大小的改变而变化而变化。由于大多数计算都是以较小空间和较短的时间形状进行,因此培训速度加快。为了进一步加快与GPU的平行化,我们还在不损失速度的情况下,对大型小型VVV-ROK培训进行小型至大型小型培训,而没有损失速度。