We propose a self-supervised framework to learn scene representations from video that are automatically delineated into background, characters, and their animations. Our method capitalizes on moving characters being equivariant with respect to their transformation across frames and the background being constant with respect to that same transformation. After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components. As far as we know, we are the first method to perform unsupervised extraction and synthesis of interpretable background, character, and animation. We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
翻译:我们提出一个自我监督的框架, 来从视频中学习场景演示, 这些场景演示会自动被定位为背景、 字符和动画。 我们的方法是利用移动字符在跨框架的变换和同一变换的背景上是不变的。 训练后, 我们可以实时操作图像编码, 以创建所划定组件的无形组合 。 据我们所知, 我们是第一个对可解释背景、 字符 和动画 进行未经监视的提取和合成的方法。 我们展示了三个数据集的结果: 移动具有背景的 MNIST 、 2D 视频游戏图案和时装模型 。