Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training, often utilizing additional semantic information (a.k.a. side information) to bridge the training (seen) classes and the unseen classes. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, due to the shortage of fine-grained annotations, the attribute imbalance and co-occurrence, the current methods often fail to discriminate those subtle visual distinctions between images, which limits their performances. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pretrained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images, (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance, and (3) proposed a multi-task learning policy for considering multi-model objectives. With extensive experiments on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark, we find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
翻译:零点学习( ZSL) 旨在预测那些在培训期间从未出现过样本的隐蔽班级,通常使用额外的语义信息( a.k.a.侧面信息)来连接培训( 隐蔽) 班级和隐蔽班级。 用于零点图像分类的最有效和广泛使用的语义信息之一,是课堂级视觉特征说明的属性。然而,由于缺少细微的注释、属性不平衡和共发现象,目前的方法往往无法区分这些图像之间的微妙视觉差异,从而限制了它们的性能。 在本文中,我们提出了一种基于变异器的终端到端 ZSLSL 方法, 名为 DUET, 通过自上式多模式的多模式学习模式模式,整合了预先培训语言模型( PLMs)的潜在语义知识。 具体地说,我们(1) 开发了一个跨模式的语义语义基础网络,以调查模型从图像中分解的语义属性特性,(2) 应用了分级对比水平学习战略,以进一步加强模型对精细三点判解的判读的逻辑基准, 。