An agent that is capable of predicting what happens next can perform a variety of tasks through planning with no additional training. Furthermore, such an agent can internally represent the complex dynamics of the real-world and therefore can acquire a representation useful for a variety of visual perception tasks. This makes predicting the future frames of a video, conditioned on the observed past and potentially future actions, an interesting task which remains exceptionally challenging despite many recent advances. Existing video prediction models have shown promising results on simple narrow benchmarks but they generate low quality predictions on real-life datasets with more complicated dynamics or broader domain. There is a growing body of evidence that underfitting on the training data is one of the primary causes for the low quality predictions. In this paper, we argue that the inefficient use of parameters in the current video models is the main reason for underfitting. Therefore, we introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks while having similar parameter count as the current state-of-the-art models. We analyze the consequences of overfitting, illustrating how it can produce unexpected outcomes such as generating high quality output by repeating the training data, and how it can be mitigated using existing image augmentation techniques. As a result, FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
翻译:能够预测下一步会发生什么的代理商可以通过没有额外培训的规划来完成各种任务。此外,这样的代理商可以在内部代表真实世界的复杂动态,因此可以获得对各种视觉认知任务有用的代表。这可以预测视频的未来框架,以观察到的过去和潜在的未来行动为条件,尽管最近取得了许多进展,但这是一项令人感兴趣的任务,尽管最近取得了许多进展,仍然非常具有挑战性。现有的视频预测模型在简单狭窄的基准上展示了有希望的结果,但在具有更复杂动态或更广泛域的真实生活数据集上却产生低质量预测。越来越多的证据表明,培训数据不足是低质量预测的主要原因之一。在本文件中,我们争辩说,目前视频模型中参数的使用效率低下是造成不足的主要原因。因此,我们引入了一个新的结构,名为Fit Vid,它能够严重地超编于共同基准,同时将参数与当前的最新模型相匹配。我们分析了过度配置的后果,说明它如何产生出出出出出出出出高质量产出的意外结果,例如通过重复培训结果,在四个不同的模型上,如何通过调整现有模型来降低现有四种不同的模型。