Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high computational complexity and the lack of natural tokenization. In this paper, we propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer. By factoring the video into objects, our fully unsupervised model is able to learn complex spatio-temporal dynamics of multiple interacting objects in a scene and generate future frames of the video. Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU. We compare our model with previous RNN-based approaches as well as other possible video transformer baselines. We demonstrate OCVT performs well when compared to baselines in generating future frames. OCVT also develops useful representations for video reasoning, achieving start-of-the-art performance on the CATER task.
翻译:在许多自然语言处理任务中,变异器是成功的。然而,由于计算复杂程度高,缺乏自然象征性化,将变异器应用到视频域,用于长期视频生成和场景理解等任务仍然难以实现。在本文中,我们建议使用对象中心视频变异器(OCVT),将场景分解为适合用于基因化视频变异器的标牌。通过将视像作为对象的乘数,我们完全不受监督的模型能够学习多交互物体在现场的复杂时空动态,并生成视频的未来框架。我们的模型还大大提高了记忆效率,而不是基于像素的模型,从而能够用一个48GBPU单个的70个框架来培训长于70个的视频。我们将我们的模型与以前基于 RNN 的模型以及其他可能的视频变异器基线进行比较。我们将OCVT与未来框架的基线相比较时表现良好。OCVT还开发了用于视频推理的有用演示,从而实现了CATER任务中的初始表现。