Features obtained from object recognition CNNs have been widely used for measuring perceptual similarities between images. Such differentiable metrics can be used as perceptual learning losses to train image enhancement models. However, the choice of the distance function between input and target features may have a consequential impact on the performance of the trained model. While using the norm of the difference between extracted features leads to limited hallucination of details, measuring the distance between distributions of features may generate more textures; yet also more unrealistic details and artifacts. In this paper, we demonstrate that aggregating 1D-Wasserstein distances between CNN activations is more reliable than the existing approaches, and it can significantly improve the perceptual performance of enhancement models. More explicitly, we show that in imaging applications such as denoising, super-resolution, demosaicing, deblurring and JPEG artifact removal, the proposed learning loss outperforms the current state-of-the-art on reference-based perceptual losses. This means that the proposed learning loss can be plugged into different imaging frameworks and produce perceptually realistic results.
翻译:从目标识别有线电视新闻网获得的特征被广泛用于测量图像之间的概念相似性。这些不同的计量标准可以用作培养图像增强模型的认知学习损失。然而,选择输入和目标特征之间的距离功能可能会对经过培训的模型的性能产生相应的影响。虽然使用提取功能差异的规范导致对细节的幻觉有限,但测量特征分布之间的距离可能会产生更多的纹理;但更不切实际的细节和工艺品。在本文中,我们证明在有线电视新闻网启动之间的1D-Wasserstein距离总和比现有方法更加可靠,并且能够大大改善增强模型的认知性能。更明确地说,我们表明,在图像应用中,如分流、超分辨率、感光谱、分解和JPEG工艺品清除,拟议的学习损失会超越基于参考的视觉损失的当前状态。这意味着,拟议的学习损失可以被塞入不同的成像框架并产生概念上的现实结果。