Learning based video compression attracts increasing attention in the past few years. The previous hybrid coding approaches rely on pixel space operations to reduce spatial and temporal redundancy, which may suffer from inaccurate motion estimation or less effective motion compensation. In this work, we propose a feature-space video coding network (FVC) by performing all major operations (i.e., motion estimation, motion compression, motion compensation and residual compression) in the feature space. Specifically, in the proposed deformable compensation module, we first apply motion estimation in the feature space to produce motion information (i.e., the offset maps), which will be compressed by using the auto-encoder style network. Then we perform motion compensation by using deformable convolution and generate the predicted feature. After that, we compress the residual feature between the feature from the current frame and the predicted feature from our deformable compensation module. For better frame reconstruction, the reference features from multiple previous reconstructed frames are also fused by using the non-local attention mechanism in the multi-frame feature fusion module. Comprehensive experimental results demonstrate that the proposed framework achieves the state-of-the-art performance on four benchmark datasets including HEVC, UVG, VTL and MCL-JCV.
翻译:过去几年来,基于学习的视频压缩吸引了越来越多的注意力。 以前的混合编码方法依靠像素空间操作来减少空间和时间冗余,这可能会受到不准确的动作估计或低效运动补偿的影响。 在这项工作中,我们建议通过在功能空间中执行所有主要操作(即运动估计、运动压缩、运动补偿和剩余压缩),建立一个地貌空间视频编码网络(FVC)。具体地说,在拟议的变形补偿模块中,我们首先在功能空间中应用运动估计来生成运动信息(即抵消地图),通过使用自动编码器风格网络进行压缩。然后,我们通过使用变形变形变形和生成预测特性来进行运动补偿。之后,我们压缩当前框架特征与我们变形补偿模块的预测特征之间的剩余特征。为了更好的框架重建,还利用多功能聚合模块中的非本地关注机制将先前框架的参考特征结合起来。 综合实验结果显示,拟议的框架在四个基准数据模型上实现了州- 州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-