Existing computer vision research in artwork struggles with artwork's fine-grained attributes recognition and lack of curated annotated datasets due to their costly creation. To the best of our knowledge, we are one of the first methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural network on a variety of artwork images and text descriptions pairs. CLIP is able to learn directly from free-form art descriptions, or, if available, curated fine-grained labels. Model's zero-shot capability allows predicting accurate natural language description for a given image, without directly optimizing for the task. Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition. We use the iMet Dataset, which we consider the largest annotated artwork dataset. In this benchmark we achieved competitive results using only self-supervision.
翻译:艺术作品的精细雕刻属性识别的艺术作品斗争中现有的计算机视觉研究,以及由于创建成本昂贵而缺乏经整理的附加说明的数据集。 据我们所知,我们是最先使用CLIP(语言美学预科培训)对各种艺术作品图像和文本描述配对进行神经网络培训的方法之一。 CLIP能够直接从自由形式的艺术描述中学习,或者如果有的话,从精细雕刻的标签中直接学习。模型的零射出能力可以预测特定图像的准确自然语言描述,而不必直接优化任务。我们的方法旨在解决两个挑战:实例检索和精细雕刻的艺术属性识别。我们使用iMet数据集,我们认为这是一个最大的附加说明的艺术数据集。在这个基准中,我们只使用自我超视功能取得了竞争性的结果。