We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and warping operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research.
翻译:我们展示了变压器如何广泛简化神经视频压缩。 以往的方法一直依赖越来越多的建筑偏见和前科, 包括运动预测和扭曲操作, 导致复杂的模型。 相反, 我们独立地将输入框架映射到演示中, 并使用变压器模拟其依赖关系, 让它预测过去给定的未来表达方式的分布。 由此产生的视频压缩变压器比标准视频压缩数据集的以往方法要好。 合成数据的实验显示, 我们的模型学会处理复杂的运动模式, 比如从数据中拉开、 模糊和淡化。 我们的方法很容易执行, 我们发布代码来方便未来的研究 。