We address end-to-end learned video compression with a special focus on better learning and utilizing temporal contexts. For temporal context mining, we propose to store not only the previously reconstructed frames, but also the propagated features into the generalized decoded picture buffer. From the stored propagated features, we propose to learn multi-scale temporal contexts, and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder. Our scheme discards the parallelization-unfriendly auto-regressive entropy model to pursue a more practical decoding time. We compare our scheme with x264 and x265 (representing industrial software for H.264 and H.265, respectively) as well as the official reference software for H.264, H.265, and H.266 (JM, HM, and VTM, respectively). When intra period is 32 and oriented to PSNR, our scheme outperforms H.265--HM by 14.4% bit rate saving; when oriented to MS-SSIM, our scheme outperforms H.266--VTM by 21.1% bit rate saving.
翻译:我们处理端到端学的视频压缩,特别侧重于更好地学习和利用时间背景。关于时间背景的开采,我们提议不仅将先前重建的框框,而且将传播的特征存储到通用解码图像缓冲中。从存储的传播特征中,我们提议学习多尺度的时间背景,并将学到的时间背景再填入我们的压缩计划模块,包括背景编码器-解码器、框架生成器和时间背景编码器。我们的计划抛弃了平行-不友好的自动递增性英特质模型,以追求更实用的解码时间。我们将我们的计划与x264和x265(分别代表H.264和H.265的工业软件)以及H.264、H.265和H.266的正式参考软件进行比较。当内部期间为32个且面向PSNR时,我们的计划将H.265-HM比14.4%的位速率储蓄率比H高出14.4%;当转向MS-SSIM时,我们的计划将比值率率比率率超过211%的H.266-VTM。