Convolutional neural networks (CNNs) can model complicated non-linear relations between images. However, they are notoriously sensitive to small changes in the input. Most CNNs trained to describe image-to-image mappings generate temporally unstable results when applied to video sequences, leading to flickering artifacts and other inconsistencies over time. In order to use CNNs for video material, previous methods have relied on estimating dense frame-to-frame motion information (optical flow) in the training and/or the inference phase, or by exploring recurrent learning structures. We take a different approach to the problem, posing temporal stability as a regularization of the cost function. The regularization is formulated to account for different types of motion that can occur between frames, so that temporally stable CNNs can be trained without the need for video material or expensive motion estimation. The training can be performed as a fine-tuning operation, without architectural modifications of the CNN. Our evaluation shows that the training strategy leads to large improvements in temporal smoothness. Moreover, in situations where the quantity of training data is limited, the regularization can help in boosting the generalization performance to a much larger extent than what is possible with na\"ive augmentation strategies.
翻译:摄像神经网络(CNNs)可以模拟图像之间复杂的非线性关系。 但是,它们对于输入的微小变化具有臭名昭著的敏感性。 多数受过培训的描述图像到图像绘图的有线电视新闻网在视频序列应用时会产生暂时不稳定的结果,导致工艺品的闪烁和其他前后不一致。 为了将CNN用于视频材料,以前的方法依靠的是在培训和/或推断阶段对密集的框架到框架的动态信息(光流)进行估计,或者探索经常性的学习结构。 我们对这个问题采取不同的做法, 使时间稳定成为成本功能的正规化。 正规化是为了说明在框架之间可以发生的不同类型运动, 以便可以对暂时稳定的CNN进行训练, 而不需要视频材料或昂贵的动作估计。 培训可以作为一种微调操作进行, 而不对CNN进行建筑学网进行修改。 我们的评估表明, 培训战略导致时间的平稳性能大大改善。 此外, 在培训数据数量有限的情况下, 正规化可以帮助将一般化的绩效提升到远大于升级的战略。