Video compression has always been a popular research area, where many traditional and deep video compression methods have been proposed. These methods typically rely on signal prediction theory to enhance compression performance by designing high efficient intra and inter prediction strategies and compressing video frames one by one. In this paper, we propose a novel model-based video compression (MVC) framework that regards scenes as the fundamental units for video sequences. Our proposed MVC directly models the intensity variation of the entire video sequence in one scene, seeking non-redundant representations instead of reducing redundancy through spatio-temporal predictions. To achieve this, we employ implicit neural representation (INR) as our basic modeling architecture. To improve the efficiency of video modeling, we first propose context-related spatial positional embedding (CRSPE) and frequency domain supervision (FDS) in spatial context enhancement. For temporal correlation capturing, we design the scene flow constrain mechanism (SFCM) and temporal contrastive loss (TCL). Extensive experimental results demonstrate that our method achieves up to a 20\% bitrate reduction compared to the latest video coding standard H.266 and is more efficient in decoding than existing video coding strategies.
翻译:视频压缩一直是一个受欢迎的研究领域,许多传统和深层的视频压缩方法都是在这里提出的。这些方法通常依靠信号预测理论,通过设计高效的内部和内部预测策略和逐个压缩视频框架来提高压缩性能。在本文中,我们提出了一个新型的基于模型的视频压缩(MVC)框架,将场景视为视频序列的基本单位。我们提议的视频压缩(MVC)直接模拟一个场景的整个视频序列的强度变化,寻求非冗余表达,而不是通过时空预测减少冗余。为了做到这一点,我们使用隐含的神经代表(INR)作为我们的基本建模结构。为了提高视频建模的效率,我们首先提议在空间背景下加强与环境有关的空间定位空间定位(CRSPE)和频域监督(FDS)。关于时间关联的捕捉,我们设计场流约束机制(SFCM)和时间对比损失(TCL),广泛的实验结果表明,我们的方法比最新的视频编译标准H.266,在解码战略方面效率更高。</s>