Video quality assessment (VQA) aims to simulate the human perception of video quality, which is influenced by factors ranging from low-level color and texture details to high-level semantic content. To effectively model these complicated quality-related factors, in this paper, we decompose video into three levels (\ie, patch level, frame level, and clip level), and propose a novel Zoom-VQA architecture to perceive spatio-temporal features at different levels. It integrates three components: patch attention module, frame pyramid alignment, and clip ensemble strategy, respectively for capturing region-of-interest in the spatial dimension, multi-level information at different feature levels, and distortions distributed over the temporal dimension. Owing to the comprehensive design, Zoom-VQA obtains state-of-the-art results on four VQA benchmarks and achieves 2nd place in the NTIRE 2023 VQA challenge. Notably, Zoom-VQA has outperformed the previous best results on two subsets of LSVQ, achieving 0.8860 (+1.0%) and 0.7985 (+1.9%) of SRCC on the respective subsets. Adequate ablation studies further verify the effectiveness of each component. Codes and models are released in https://github.com/k-zha14/Zoom-VQA.
翻译:缩放视觉质量评估:利用块、帧和片段集成
视频质量评估(VQA)旨在模拟人类对视频质量的感知,其受从低级颜色和纹理细节到高级语义内容的因素影响。为了有效地建模这些复杂的与质量相关的因素,本文将视频分解为三个级别(即块级别、帧级别和片段级别),并提出一种新的Zoom-VQA架构,以感知不同级别的时空特征。它集成了三个组件:块注意模块、帧金字塔对齐和片段集成策略,分别用于捕捉空间维度中感兴趣区域、不同特征级别上的多级信息和分布在时间维度上的失真。由于全面的设计,Zoom-VQA在四个VQA基准测试上获得了最先进的结果,并在NTIRE 2023 VQA挑战赛中获得第二名。值得注意的是,Zoom-VQA在LSVQ的两个子集上超过了以前的最佳结果,在相应子集上分别实现了0.8860(+1.0%)和0.7985(+1.9%)的SRCC。充分的消融研究进一步验证了每个组件的有效性。代码和模型已发布在https://github.com/k-zha14/Zoom-VQA。