With the development of higher resolution contents and displays, its significant volume poses significant challenges to the goals of acquiring, transmitting, compressing and displaying high quality video content. In this paper, we propose a new deep learning video compression architecture that does not require motion estimation, which is the most expensive element of modern hybrid video compression codecs like H.264 and HEVC. Our framework exploits the regularities inherent to video motion, which we capture by using displaced frame differences as video representations to train the neural network. In addition, we propose a new space-time reconstruction network based on both an LSTM model and a UNet model, which we call LSTM-UNet. The combined network is able to efficiently capture both temporal and spatial video information, making it highly amenable for our purposes. The new video compression framework has three components: a Displacement Calculation Unit (DCU), a Displacement Compression Network (DCN), and a Frame Reconstruction Network (FRN), all of which are jointly optimized against a single perceptual loss function. The DCU obviates the need for motion estimation as in hybrid codecs, and is less expensive. In the DCN, an RNN-based network is utilized to conduct variable bit-rate encoding based on a single round of training. The LSTM-UNet is used in the FRN to learn space time differential representations of videos. Our experimental results show that our compression model, which we call the MOtionless VIdeo Codec (MOVI-Codec), learns how to efficiently compress videos without computing motion. Our experiments show that MOVI-Codec outperforms the video coding standard H.264, and is highly competitive with, and sometimes exceeds the performance of the modern global standard HEVC codec, as measured by MS-SSIM.
翻译:随着高分辨率内容和显示的开发,其大量内容对获取、传输、压缩和展示高质量视频内容的目标提出了重大挑战。在本文中,我们提议建立一个新的深深层次学习视频压缩结构,不需要运动估计,这是现代混合视频压缩代码中最昂贵的元素,如H.264和HIVC。我们的框架利用视频运动固有的规律性能,我们通过将离位框架差异用作视频演示来显示神经网络。此外,我们提议建立一个新的时空重建网络,以LSTM模型和UNet模型为基础,我们称之为LSTM-UNet。联合网络能够高效地捕捉时间和空间视频信息,使其高度适合我们的目的。新的视频压缩框架有三个组成部分:流离失所计算股(DCU),流离失所校正网络(DCN),以及框架重建网络(FRNFNS),我们用来衡量标准损失功能的模型,DCUPS,我们用来在混合代码中进行运动估算,而BERC-C-CS-demodeal Modeal,我们用一个可变的图像网络,我们用MS-C-modeal-mode Adal 正在使用一个测试,我们的标准,我们用来进行一个标准,我们的标准学习模型,我们的标准,我们的标准-mode-C-C-modeal-modeal-deal-modeal-modeal-modal-modeal-modal 和BLS-mode-mode-moudal-modal-modal-modal-modal-modal 。