While it is nearly effortless for humans to quickly assess the perceptual similarity between two images, the underlying processes are thought to be quite complex. Despite this, the most widely used perceptual metrics today, such as PSNR and SSIM, are simple, shallow functions, and fail to account for many nuances of human perception. Recently, the deep learning community has found that features of the VGG network trained on the ImageNet classification task has been remarkably useful as a training loss for image synthesis. But how perceptual are these so-called "perceptual losses"? What elements are critical for their success? To answer these questions, we introduce a new Full Reference Image Quality Assessment (FR-IQA) dataset of perceptual human judgments, orders of magnitude larger than previous datasets. We systematically evaluate deep features across different architectures and tasks and compare them with classic metrics. We find that deep features outperform all previous metrics by huge margins. More surprisingly, this result is not restricted to ImageNet-trained VGG features, but holds across different deep architectures and levels of supervision (supervised, self-supervised, or even unsupervised). Our results suggest that perceptual similarity is an emergent property shared across deep visual representations.
翻译:虽然对于人类来说,快速评估两种图像之间的概念相似性几乎是无足轻重的,但深层过程被认为是相当复杂的。尽管如此,如今最广泛使用的全参考图像质量评估(FR-IQA)是简单、浅浅的功能,不能说明人类认知的许多细微之处。最近,深层次的学习界发现,在图像网分类任务方面受过培训的VGG网络的特征作为图像合成培训的损失非常有用。但是,这些所谓的“概念损失”如何是概念性的?哪些因素对其成功至关重要?为了回答这些问题,我们引入了一个新的全参考图像质量评估(FR-IQA)数据集,即人类感知性判断,其数量比以往数据集大得多。我们系统地评估了不同架构和任务之间的深度特征,并将其与典型的尺度进行比较。我们发现,深层次的特征在巨大的范围内超越了以往的所有计量标准。更令人惊讶的是,这一结果并不局限于图像网培训的VGGG的特征,而是保存在不同的深层架构和层次的监督结构中(监督中,自我监督、自我监督的图像显示或视觉显示的全局)。