Measuring the perception of visual content is a long-standing problem in computer vision. Many mathematical models have been developed to evaluate the look or quality of an image. Despite the effectiveness of such tools in quantifying degradations such as noise and blurriness levels, such quantification is loosely coupled with human language. When it comes to more abstract perception about the feel of visual content, existing methods can only rely on supervised models that are explicitly trained with labeled data collected via laborious user study. In this paper, we go beyond the conventional paradigms by exploring the rich visual language prior encapsulated in Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner. In particular, we discuss effective prompt designs and show an effective prompt pairing strategy to harness the prior. We also provide extensive experiments on controlled datasets and Image Quality Assessment (IQA) benchmarks. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments. Code will be avaliable at https://github.com/IceClear/CLIP-IQA.
翻译:测量视觉内容的感知是计算机视野中长期存在的一个问题。许多数学模型已经开发出来,以评价图像的外观或质量。尽管这些工具在量化噪音和模糊度等图像的退化方面是有效的,但这种量化与人文语言是松散的。当关于视觉内容感觉的更抽象的感知时,现有方法只能依靠通过艰苦用户研究收集的标签数据进行明确培训的监管模型。在本文中,我们超越了常规模式,探索了以前在比较语言-图像培训前(CLIP)模型中包含的丰富的视觉语言,以便以零发方式评估图像的质量感知(外观)和抽象感知(感知)。特别是,我们讨论有效的快速设计,并展示一种有效的快速配对战略来利用先前的感。我们还对受控数据集和图像质量评估基准进行了广泛的实验。我们的结果显示,CLIP捕捉到有意义的前期,能够概括不同的视觉评估。代码在 https://github.com/ICLIQlealIP上是值得肯定的。