Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferability and discriminative attribute localization of visual features. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations in ZSL. TransZero++ consists of an attribute$\rightarrow$visual Transformer sub-net (AVT) and a visual$\rightarrow$attribute Transformer sub-net (VAT). Specifically, AVT first takes a feature augmentation encoder to alleviate the cross-dataset problem, and improves the transferability of visual features by reducing the entangled relative geometry relationships among region features. Then, an attribute$\rightarrow$visual decoder is employed to localize the image regions most relevant to each attribute in a given image for attribute-based visual feature representations. Analogously, VAT uses the similar feature augmentation encoder to refine the visual features, which are further applied in visual$\rightarrow$attribute decoder to learn visual-based attribute features. By further introducing semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings via semantical collaborative learning. Extensive experiments show that TransZero++ achieves the new state-of-the-art results on three challenging ZSL benchmarks. The codes are available at: \url{https://github.com/shiming-chen/TransZero_pp}.
翻译:零点学习 (ZSL) 通过将语义学知识从可见的类中传输到看不见的类中, 解决了新颖的阶级识别问题。 现有基于关注的模型通过仅仅使用单向关注, 忽略视觉特征的可转移性和歧视性属性本地化, 从而在单一图像中学习低级区域特征。 在本文中, 我们提出一个名为 TransZero++ 的跨属性引导变异器网络, 以完善视觉特征, 并学习ZSL 中语义- 调变异的视觉嵌入表示的准确属性。 TranserZero+ 包括一个属性$\rightrow$ 视觉变异器子网( AVAT) 和一个视觉化$\rightright$attrailate 变异形器子子网络(VAT) 。 具体来说, AVVT首先使用一个增强功能变异端变异端变异器来减轻交叉数据配置问题, 改善视觉特征的可转移性功能。 然后, 将一个具有挑战性变异的视觉变异性变码显示, 将每个图像变异性变变变变变变变变的图像的图像显示为两个图像的图像显示。