The study of video prediction models is believed to be a fundamental approach to representation learning for videos. While a plethora of generative models for predicting the future frame pixel values given the past few frames exist, the quantitative evaluation of the predicted frames has been found to be extremely challenging. In this context, we introduce the problem of naturalness evaluation, which refers to how natural or realistic a predicted video looks. We create the Indian Institute of Science VIdeo Naturalness Evaluation (IISc VINE) Database consisting of 300 videos, obtained by applying different prediction models on different datasets, and accompanying human opinion scores. We collected subjective ratings of naturalness from 50 human participants for these videos. Our subjective study reveals that human observers were highly consistent in their judgments of naturalness. We benchmark several popularly used measures for evaluating video prediction and show that they do not adequately correlate with these subjective scores. We introduce two new features to effectively capture naturalness, motion-compensated cosine similarities of deep features of predicted frames with past frames, and deep features extracted from rescaled frame differences. We show that our feature design leads to state of the art naturalness prediction in accordance with human judgments on our IISc VINE Database. The database and code are publicly available on our project website: https://nagabhushansn95.github.io/publications/2020/vine
翻译:视频预测模型的研究被认为是为视频进行代表性学习的一个基本方法。虽然存在大量用于预测未来框架像素值的基因模型,但根据过去几个框架,预测框架的定量评估具有极大的挑战性。在这方面,我们提出了自然性评估问题,其中提到一个预测视频的外观如何自然或现实。我们创建了印度科学研究所VIdeo自然评估数据库(IISc VINE),由300个视频组成,通过对不同的数据集应用不同的预测模型和附带的人类观点评分获得。我们收集了50名人类参与者对这些视频的自然性主观评级。我们的主观研究显示,人类观察员对自然性的判断高度一致。我们为评估视频预测设定了几种通用措施,并表明它们与这些主观分数没有充分关联。我们引入了两个新的特征,以有效捕捉自然性、与过去框架的预测框架的深度相近似、以及从重新标定的框架差异中提取的深度特征。我们的身份设计导致艺术自然性预测状态的状态,来自50个人类参与者的自然性判断。我们主观性研究发现,我们用了一些常用的措施,用来评估视频预测,显示它们与我们现有的人文数据库二号。我们现有的数据库中的V95/Rismusmusmusmusmus。