Aesthetic assessment of images can be categorized into two main forms: numerical assessment and language assessment. Aesthetics caption of photographs is the only task of aesthetic language assessment that has been addressed. In this paper, we propose a new task of aesthetic language assessment: aesthetic visual question and answering (AVQA) of images. If we give a question of images aesthetics, model can predict the answer. We use images from \textit{www.flickr.com}. The objective QA pairs are generated by the proposed aesthetic attributes analysis algorithms. Moreover, we introduce subjective QA pairs that are converted from aesthetic numerical labels and sentiment analysis from large-scale pre-train models. We build the first aesthetic visual question answering dataset, AesVQA, that contains 72,168 high-quality images and 324,756 pairs of aesthetic questions. Two methods for adjusting the data distribution have been proposed and proved to improve the accuracy of existing models. This is the first work that both addresses the task of aesthetic VQA and introduces subjectiveness into VQA tasks. The experimental results reveal that our methods outperform other VQA models on this new task.
翻译:图像的审美评估可分为两种主要形式:数字评估和语言评估。照片的美学说明是所处理的美学语言评估的唯一任务。在本文中,我们建议一项新的美学语言评估任务:美学直观问题和图像的回答(AVQA)。如果我们给出图像美学问题,模型可以预测答案。我们使用来自\textit{www.flickr.com}的图像。目标的QA配对是通过拟议的审美属性分析算法产生的。此外,我们引入了从大规模前排模型的审美数字标签和情绪分析转换成主观的QA配对。我们建立了第一个美学直观问题回答数据集,AesVQA, 包含72,168个高质量图像和324,756对美学问题。我们提出了两种调整数据分布的方法,并证明可以提高现有模型的准确性。这是处理审美VQA任务并将主观性引入VQA任务的第一个工作。实验结果显示,我们的方法超越了其他模型。