Today's image prediction methods struggle to change the locations of objects in a scene, producing blurry images that average over the many positions they might occupy. In this paper, we propose a simple change to existing image similarity metrics that makes them more robust to positional errors: we match the images using optical flow, then measure the visual similarity of corresponding pixels. This change leads to crisper and more perceptually accurate predictions, and can be used with any image prediction network. We apply our method to predicting future frames of a video, where it obtains strong performance with simple, off-the-shelf architectures.
翻译:今天的图像预测方法试图改变一个场景中物体的位置, 产生模糊的图像, 其平均分布在它们可能占据的许多位置上。 在本文中, 我们建议简单修改现有的图像相似度指标, 使其更强于定位错误: 我们用光学流来匹配图像, 然后测量相应的像素的视觉相似性。 这一变化导致精确和更准确的预测, 并且可以用于任何图像预测网络 。 我们运用我们的方法来预测一个视频的未来框架, 在那里,它以简单、 现成的建筑来获得强大的性能 。