In the deep learning era, long video generation of high-quality still remains challenging due to the spatio-temporal complexity and continuity of videos. Existing prior works have attempted to model video distribution by representing videos as 3D grids of RGB values, which impedes the scale of generated videos and neglects continuous dynamics. In this paper, we found that the recent emerging paradigm of implicit neural representations (INRs) that encodes a continuous signal into a parameterized neural network effectively mitigates the issue. By utilizing INRs of video, we propose dynamics-aware implicit generative adversarial network (DIGAN), a novel generative adversarial network for video generation. Specifically, we introduce (a) an INR-based video generator that improves the motion dynamics by manipulating the space and time coordinates differently and (b) a motion discriminator that efficiently identifies the unnatural motions without observing the entire long frame sequences. We demonstrate the superiority of DIGAN under various datasets, along with multiple intriguing properties, e.g., long video synthesis, video extrapolation, and non-autoregressive video generation. For example, DIGAN improves the previous state-of-the-art FVD score on UCF-101 by 30.7% and can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
翻译:在深层次的学习时代,高品质的长视频生成仍然具有挑战性,因为片段-时空复杂性和视频的连续性。现有的以往工作尝试将视频传播模式作为视频的3D格格,作为RGB值的3D网格,从而阻碍生成视频的规模,忽视连续动态。在本文中,我们发现,最近出现的隐含神经表层(INRs)模式,将连续信号编码成一个参数化神经网络的连续信号,有效地缓解了这一问题。通过利用视频IRRS,我们提出了动态觉知隐隐隐隐隐隐隐隐隐隐的对抗网络(DIGAN),这是为视频生成的新型基因对抗网络。具体地说,我们引入了(a)基于INR的视频发送模式,通过对空间和时间进行不同协调来改进运动动态动态。 (b)一个动态分析器,在不观察整个长框序列的情况下有效地识别非自然动作。 我们展示了DIGAN在各种数据集下的优势,同时展示了多种令人感兴趣的特性,例如,长视频合成、视频外插图段、以及比以往128-101年的40分级制制制制制制制制的40级的图像,可以改进以往的40级制制制制制的图像。