The use of Deep Learning and Computer Vision in the Cultural Heritage domain is becoming highly relevant in the last few years with lots of applications about audio smart guides, interactive museums and augmented reality. All these technologies require lots of data to work effectively and be useful for the user. In the context of artworks, such data is annotated by experts in an expensive and time consuming process. In particular, for each artwork, an image of the artwork and a description sheet have to be collected in order to perform common tasks like Visual Question Answering. In this paper we propose a method for Visual Question Answering that allows to generate at runtime a description sheet that can be used for answering both visual and contextual questions about the artwork, avoiding completely the image and the annotation process. For this purpose, we investigate on the use of GPT-3 for generating descriptions for artworks analyzing the quality of generated descriptions through captioning metrics. Finally we evaluate the performance for Visual Question Answering and captioning tasks.
翻译:在过去几年里,文化遗产领域深层学习和计算机视野的使用变得非常重要,许多应用涉及音频智能指南、互动博物馆和扩大现实。所有这些技术都需要大量数据才能有效发挥作用,对用户有用。在艺术作品方面,这类数据由专家在昂贵和耗时的进程中附加说明。特别是,对于每一个艺术作品,必须收集艺术作品的图像和描述表,以便执行视觉问答等共同任务。在本文中,我们提出了视觉问答方法,以便能够在运行时生成一个描述表,用于回答关于艺术作品的视觉和背景问题,完全避免图像和说明过程。为此,我们调查GPT-3用于制作艺术作品描述的说明,通过说明性指标分析生成的描述的质量。最后,我们评估了视觉问答和说明性任务的性能。