The design of image and video quality assessment (QA) algorithms is extremely important to benchmark and calibrate user experience in modern visual systems. A major drawback of the state-of-the-art QA methods is their limited ability to generalize across diverse image and video datasets with reasonable distribution shifts. In this work, we leverage the denoising process of diffusion models for generalized image QA (IQA) and video QA (VQA) by understanding the degree of alignment between learnable quality-aware text prompts and images or video frames. In particular, we learn cross-attention maps from intermediate layers of the denoiser of latent diffusion models (LDMs) to capture quality-aware representations of images or video frames. Since applying text-to-image LDMs for every video frame is computationally expensive for videos, we only estimate the quality of a frame-rate sub-sampled version of the original video. To compensate for the loss in motion information due to frame-rate sub-sampling, we propose a novel temporal quality modulator. Our extensive cross-database experiments across various user-generated, synthetic, low-light, frame-rate variation, ultra high definition, and streaming content-based databases show that our model can achieve superior generalization in both IQA and VQA.
翻译:图像与视频质量评估算法的设计对于现代视觉系统中用户体验的基准测试与校准至关重要。当前最先进的质量评估方法存在一个主要缺陷:其难以在具有合理分布偏移的多样化图像与视频数据集间实现有效泛化。本研究通过探究可学习的质量感知文本提示与图像或视频帧之间的对齐程度,利用扩散模型去噪过程实现广义的图像质量评估与视频质量评估。具体而言,我们从潜在扩散模型去噪器的中间层提取交叉注意力图,以捕获图像或视频帧的质量感知表征。由于对视频的每一帧都应用文本到图像的潜在扩散模型计算成本过高,我们仅对原始视频进行帧率下采样后的版本进行质量估计。为补偿因帧率下采样导致的运动信息损失,我们提出了一种新颖的时序质量调制器。我们在多种用户生成内容、合成内容、低光照、帧率变化、超高清及基于流媒体内容的数据库上进行了广泛的跨数据库实验,结果表明我们的模型在图像质量评估与视频质量评估任务中均能实现优异的泛化性能。