This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input. We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos. We employ an encoder-decoder architecture where the predicted action labels act as bottleneck. The network is constrained to learn a rich action space using, as main driving loss, a reconstruction loss on the generated video. We demonstrate the effectiveness of the proposed approach on several datasets with wide environment variety. Further details, code and examples are available on our project page willi-menapace.github.io/playable-video-generation-website.
翻译:本文介绍了可播放视频生成( PVG) 的不受监督的学习问题。 在 PVG 中, 我们的目标是允许用户控制生成的视频, 在播放视频时的每个步骤选择一个独立的动作。 任务的困难在于学习语义一致的动作, 以及生成以用户输入为条件的现实视频。 我们为 PVG 提出了一个以自我监督的方式对无标签视频的大型数据集进行培训的新框架 。 我们使用一个编码器- 解码器结构, 预测动作标签作为瓶颈 。 网络无法学习一个丰富的动作空间, 使用生成视频的重建损失作为主要驱动损失。 我们展示了在多个具有广泛环境多样性的数据集上的拟议方法的有效性 。 我们的项目页面 Willi-menapace. github.io/ splayable- vidue- second- webite 提供了更多细节、 代码和示例 。