Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Distortion type identification and degradation level determination is employed as an auxiliary task to train a deep learning model containing a deep Convolutional Neural Network (CNN) that extracts spatial features, as well as a recurrent unit that captures temporal information. The model is trained using a contrastive loss and we therefore refer to this training framework and resulting model as CONtrastive VIdeo Quality EstimaTor (CONVIQT). During testing, the weights of the trained model are frozen, and a linear regressor maps the learned features to quality scores in a no-reference (NR) setting. We conduct comprehensive evaluations of the proposed model on multiple VQA databases by analyzing the correlations between model predictions and ground-truth quality ratings, and achieve competitive performance when compared to state-of-the-art NR-VQA models, even though it is not trained on those databases. Our ablation experiments demonstrate that the learned representations are highly robust and generalize well across synthetic and realistic distortions. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning. The implementations used in this work have been made available at https://github.com/pavancm/CONVIQT.
翻译:视频质量评估(VQA)是许多流成和视频共享平台的一个组成部分。 我们在这里考虑的是以自我监督的方式学习感知性相关视频质量展示(CONVIQT)的问题。 扭曲型识别和降解层确定是一个辅助任务,用于培训深层学习模型,其中包含深层革命神经网络(CNN),其中提取空间特征,以及收集时间信息的经常性单元。模型是使用对比性损失来培训的,因此,我们提到这一培训框架和由此形成的模型,如CONTVIQT(CONVIQT)等。测试期间,所培训的模型的重量被冻结,并用线性递增器绘制在不参照(NR)设置中质量评分的学习特征。我们通过分析模型预测和地面质量评级之间的相互关系,对多种VQA数据库的拟议模型进行全面评价,并在与State-Art-NR-VQA模型相比,实现竞争性业绩,尽管在数据库上没有进行现实性培训。 我们的模型和高水平的自我分析中,可以使用高水平的模型。 我们的模型和高水平的自我分析表明,我们进行这种学习的结果。