Multi-label few-shot image classification (ML-FSIC) is the task of assigning descriptive labels to previously unseen images, based on a small number of training examples. A key feature of the multi-label setting is that images often have multiple labels, which typically refer to different regions of the image. When estimating prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data makes this highly challenging. As a solution, in this paper we propose to use word embeddings as a form of prior knowledge about the meaning of the labels. In particular, visual prototypes are obtained by aggregating the local feature maps of the support images, using an attention mechanism that relies on the label embeddings. As an important advantage, our model can infer prototypes for unseen labels without the need for fine-tuning any model parameters, which demonstrates its strong generalization abilities. Experiments on COCO and PASCAL VOC furthermore show that our model substantially improves the current state-of-the-art.
翻译:多标签少发图像分类(ML-FSIC)的任务是根据少量培训实例,为先前的不为人知的图像指定描述性标签。多标签设置的一个关键特征是图像通常有多个标签,通常指图像的不同区域。因此,在基于标准的环境下,在估计原型时,必须确定哪些区域与哪些标签相关,但培训数据数量有限,因此具有高度挑战性。作为解决方案,我们提议使用文字嵌入作为先前了解标签含义的一种形式。特别是,通过利用依赖标签嵌入的注意机制,汇总支持图像的本地特征图,获得了视觉原型。作为一个重要的优势,我们的模型可以推导出未见标签原型,而无需微调任何示范参数,以表明其强大的普及能力。关于COCO和PASAL VOC的实验还表明,我们的模型大大改进了当前的状态。