Measuring the perception of visual content is a long-standing problem in computer vision. Many mathematical models have been developed to evaluate the look or quality of an image. Despite the effectiveness of such tools in quantifying degradations such as noise and blurriness levels, such quantification is loosely coupled with human language. When it comes to more abstract perception about the feel of visual content, existing methods can only rely on supervised models that are explicitly trained with labeled data collected via laborious user study. In this paper, we go beyond the conventional paradigms by exploring the rich visual language prior encapsulated in Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner. In particular, we discuss effective prompt designs and show an effective prompt pairing strategy to harness the prior. We also provide extensive experiments on controlled datasets and Image Quality Assessment (IQA) benchmarks. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments. Code is avaliable at https://github.com/IceClear/CLIP-IQA.
翻译:测量视觉内容的感知是计算机视野中长期存在的一个问题。许多数学模型是用来评价图像的外观或质量的。尽管这些工具在量化噪音和模糊度等退化方面是有效的,但这种量化与人文语言是松散的。在对视觉内容感觉的更抽象的感知方面,现有方法只能依靠通过艰苦用户研究收集的标签数据进行明确培训的监督模型。在本文中,我们超越了常规范式,探索了先前在比较语言-图像培训前(CLIP)模型中包含的丰富的视觉语言,以便以零发方式评估图像的质量感知(外观)和抽象感知(感知)。特别是,我们讨论有效的快速设计,并展示一种有效的快速配对战略来利用先前的感。我们还对受控数据集和图像质量评估基准进行了广泛的实验。我们的结果显示,CLIP捕捉到有意义的前题,可以概括不同的视觉评估。代码在 https://github.com/ICLI/QLALALIALIQaleva 上是可信的。