Video-to-Video synthesis (Vid2Vid) has achieved remarkable results in generating a photo-realistic video from a sequence of semantic maps. However, this pipeline suffers from high computational cost and long inference latency, which largely depends on two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly compressed via more efficient network architectures. Nevertheless, existing methods mainly focus on slimming network architectures and ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal compression framework, \textbf{Fast-Vid2Vid}, which focuses on data aspects of generative models. It makes the first attempt at time dimension to reduce computational resources and accelerate inference. Specifically, we compress the input data stream spatially and reduce the temporal redundancy. After the proposed spatial-temporal knowledge distillation, our model can synthesize key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid interpolates intermediate frames by motion compensation with slight latency. On standard benchmarks, Fast-Vid2Vid achieves around real-time performance as 20 FPS and saves around 8x computational cost on a single V100 GPU.
翻译:视频到视频合成(Vid2Vid)在从一系列语义地图生成摄影现实视频方面取得了显著成果。然而,这一管道的计算成本高,且长期推导时间长,主要取决于两个基本因素:(1) 网络架构参数,(2) 相继数据流。最近,基于图像的基因化模型参数通过更有效的网络架构大大压缩。然而,现有方法主要侧重于缩小网络结构结构,忽视了相继数据流的大小。此外,由于缺乏时间一致性,基于图像的压缩不足以压缩视频任务。在本文件中,我们提出了一个空间时间-时间压缩框架,主要取决于两个基本因素:(1) 网络架构参数,(2) 相继数据流的顺序。它首次试图通过更高效的网络架构减少计算资源,加速推导。具体地说,我们从空间-时间上压缩输入数据流,并减少时间冗余。在拟议的空间-时间性知识蒸馏后,我们的模式可以将关键-时间压缩框架与低分辨率G2 和低 Vi- Vi- 标准的最小性平比数据流,最后,以最低的直径直径直径直径直径直径直径的直径直径直径直径直径直径直径定位,通过直径直径直径直的直径直径直径直的直的直的直定位的轨道定位的轨道定位的直定位的轨道,在20轴上。