We propose sandwiched video compression -- a video compression system that wraps neural networks around a standard video codec. The sandwich framework consists of a neural pre- and post-processor with a standard video codec between them. The networks are trained jointly to optimize a rate-distortion loss function with the goal of significantly improving over the standard codec in various compression scenarios. End-to-end training in this setting requires a differentiable proxy for the standard video codec, which incorporates temporal processing with motion compensation, inter/intra mode decisions, and in-loop filtering. We propose differentiable approximations to key video codec components and demonstrate that the neural codes of the sandwich lead to significantly better rate-distortion performance compared to compressing the original frames of the input video in two important scenarios. When transporting high-resolution video via low-resolution HEVC, the sandwich system obtains 6.5 dB improvements over standard HEVC. More importantly, using the well-known perceptual similarity metric, LPIPS, we observe $~30 \%$ improvements in rate at the same quality over HEVC. Last but not least we show that pre- and post-processors formed by very modestly-parameterized, light-weight networks can closely approximate these results.
翻译:我们提出了夹心式视频压缩,这是一种将神经网络封装在标准视频编解码器周围的视频压缩系统。夹层框架由一个神经网络前处理器和一个神经网络后处理器组成,在它们之间使用标准视频编解码器。这些网络是联合训练的,以优化速率-失真损失函数,旨在在各种压缩方案中显著改善标准编解码器的性能。在这种设置下的端到端训练需要一个可微分的标准视频编解码器代理,该代理集成了具有运动补偿、帧内和帧间模式决策以及循环滤波的时间处理。我们提出了关键视频编解码器组件的可微分近似,并证明了夹心代码比压缩原始输入视频帧在两个重要场景下显著提高了速率-失真性能。当通过低分辨率HEVC传输高分辨率视频时,夹心系统比标准HEVC获得了6.5 dB的改进。更重要的是,使用著名的感知相似度度量LPIPS,我们观察到在相同的质量下比HEVC提高了约30%的速率。最后但并非最不重要的是,我们展示了由非常适度参数化的轻量化网络形成的前处理器和后处理器可以紧密地近似这些结果。