Video prediction is commonly referred to as forecasting future frames of a video sequence provided several past frames thereof. It remains a challenging domain as visual scenes evolve according to complex underlying dynamics, such as the camera's egocentric motion or the distinct motility per individual object viewed. These are mostly hidden from the observer and manifest as often highly non-linear transformations between consecutive video frames. Therefore, video prediction is of interest not only in anticipating visual changes in the real world but has, above all, emerged as an unsupervised learning rule targeting the formation and dynamics of the observed environment. Many of the deep learning-based state-of-the-art models for video prediction utilize some form of recurrent layers like Long Short-Term Memory (LSTMs) or Gated Recurrent Units (GRUs) at the core of their models. Although these models can predict the future frames, they rely entirely on these recurrent structures to simultaneously perform three distinct tasks: extracting transformations, projecting them into the future, and transforming the current frame. In order to completely interpret the formed internal representations, it is crucial to disentangle these tasks. This paper proposes a fully differentiable building block that can perform all of those tasks separately while maintaining interpretability. We derive the relevant theoretical foundations and showcase results on synthetic as well as real data. We demonstrate that our method is readily extended to perform motion segmentation and account for the scene's composition, and learns to produce reliable predictions in an entirely interpretable manner by only observing unlabeled video data.
翻译:视频预测通常被称为预测一段视频序列的未来框架,提供了若干过去框架。它仍然是一个具有挑战性的领域,因为视觉场景根据复杂的基本动态变化而演变,例如相机的自我中心运动或每个观看对象的不同功能。这些大多隐藏在观察者身上,并表现为连续视频框架之间往往高度非线性的变化。因此,视频预测不仅对预测真实世界的视觉变化感兴趣,而且首先成为针对所观测环境的形成和动态的不受监督的学习规则。许多深层次的基于学习的最新视频预测模型,使用某种形式的经常性层,如长期短期内存(LSTMs)或Ged 常规单位(GRUs)作为模型的核心。虽然这些模型可以预测未来框架,但它们完全依赖这些经常性结构来同时执行三项不同的任务:提取转换,将其投射到未来,以及改变当前框架。为了彻底解释内部表现,至关重要的是要解析这些任务。本文建议一种完全不同的经常性的经常性层次,我们用一个完全不同的模型来解释, 来解释我们用一个完全不同的模型来解释和解释一个完全可以理解的模型的模型结构结构。