In recent years, there is a growing number of pre-trained models trained on a large corpus of data and yielding good performance on various tasks such as classifying multimodal datasets. These models have shown good performance on natural images but are not fully explored for scarce abstract concepts in images. In this work, we introduce an image/text-based dataset called Greeting Cards. Dataset (GCD) that has abstract visual concepts. In our work, we propose to aggregate features from pretrained images and text embeddings to learn abstract visual concepts from GCD. This allows us to learn the text-modified image features, which combine complementary and redundant information from the multi-modal data streams into a single, meaningful feature. Secondly, the captions for the GCD dataset are computed with the pretrained CLIP-based image captioning model. Finally, we also demonstrate that the proposed the dataset is also useful for generating greeting card images using pre-trained text-to-image generation model.
翻译:近年来,越来越多的经过培训的模型经过培训,掌握了大量数据,在诸如多式联运数据集分类等各种任务上取得了良好的业绩。这些模型在自然图像上表现良好,但对于图像中稀少的抽象概念没有进行充分的探索。在这项工作中,我们引入了一个图像/基于文本的数据集,称为Greeting Cards。具有抽象视觉概念的数据集(GCD)。在我们的工作中,我们提议汇总来自预先培训的图像和文本嵌入的特征,以学习GCD的抽象视觉概念。这使我们能够学习文本修改后的图像特征,将多模式数据流中的补充和冗余信息合并成一个单一的、有意义的特征。第二,GCD数据集的描述是用预先培训的CLIP图像说明模型计算出来的。最后,我们还表明提议的数据集对于使用经过培训的文本到图像生成模型生成贺卡图像也是有益的。