The task of video prediction and generation is known to be notoriously difficult, with the research in this area largely limited to short-term predictions. Though plagued with noise and stochasticity, videos consist of features that are organised in a spatiotemporal hierarchy, different features possessing different temporal dynamics. In this paper, we introduce Dynamic Latent Hierarchy (DLH) -- a deep hierarchical latent model that represents videos as a hierarchy of latent states that evolve over separate and fluid timescales. Each latent state is a mixture distribution with two components, representing the immediate past and the predicted future, causing the model to learn transitions only between sufficiently dissimilar states, while clustering temporally persistent states closer together. Using this unique property, DLH naturally discovers the spatiotemporal structure of a dataset and learns disentangled representations across its hierarchy. We hypothesise that this simplifies the task of modeling temporal dynamics of a video, improves the learning of long-term dependencies, and reduces error accumulation. As evidence, we demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction, is able to better represent stochasticity, as well as to dynamically adjust its hierarchical and temporal structure. Our paper shows, among other things, how progress in representation learning can translate into progress in prediction tasks.
翻译:视频预测和生成的任务众所周知是众所周知的,众所周知,视频预测和生成的任务非常困难,该领域的研究大多限于短期预测。虽然受到噪音和随机性困扰,但视频包含在时空结构中组织的各种特征,不同的特征具有不同的时间动态。在本文中,我们引入了动态中低端分级(DLH) -- -- 一种深层次潜伏模型,代表视频作为潜伏状态的等级,在不同的和流体的时间尺度上演进。每个潜伏状态都是由两个组成部分混合分布的,代表着近期的过去和预测的未来,导致模型只学习不同状态之间的转变,同时将时间性强的状态集中在一起。利用这一独特的属性,DLH自然发现数据集的宽度结构具有不同的时间动态动态动态动态动态。我们假设,这种分级潜伏模式将视频模拟时间动态动态动态变化的任务简化,改进长期依赖的学习,减少误差积累。作为证据,我们证明DLHH在动态基准中超越了状态,在动态图像结构中展示了我们更精确的变现变换的图像结构。