Videos show continuous events, yet most - if not all - video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be - time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image and video discriminators pair and propose to use a single hypernetwork-based one. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024$^2$ videos for the first time. We build our model on top of StyleGAN2 and it is just 5% more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model achieves state-of-the-art results on four modern 256$^2$ video synthesis benchmarks and one 1024$^2$ resolution one. Videos and the source code are available at the project website: https://universome.github.io/stylegan-v.
翻译:视频显示连续不断的事件, 然而大多数甚至全部视频合成框架( 如果不是全部的话) 都显示连续不断的事件。 在这项工作中, 我们思考了视频, 认为它们应该是什么 — 时间性的信号, 并扩展了神经代表的范式, 以构建一个连续时间的视频生成器。 为此, 我们首先通过定位嵌入器的镜头设计连续的运动演示。 然后, 我们探索了非常稀少的视频培训问题, 并展示了一个好的生成器可以通过使用每个短至2个框架来学习。 之后, 我们重新思考传统图像和视频歧视者配对, 并提议使用一个基于超级网络的配对。 这样可以降低培训成本, 并且向生成者提供更丰富的学习信号, 从而能够直接用10242美元2美元直接培训视频。 为此, 我们首先在StylegGAN2的顶端上设计了我们连续的运动演示的模型, 并且几乎达到相同的图像质量。 此外, 我们潜在的空间特征相似, 使得空间操作能够及时传播。 我们可以在任意的高框架速率下生成长的视频, $ 。 之前的工作斗争将达到一个比例 25xx 的平标 。