Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain, relying on the intrinsic interactions between visual and semantic information. Prior works mainly localize regions corresponding to the sharing attributes. When various visual appearances correspond to the same attribute, the sharing attributes inevitably introduce semantic ambiguity, hampering the exploration of accurate semantic-visual interactions. In this paper, we deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between attribute prototypes and visual features, constituting a progressive semantic-visual mutual adaption (PSVMA) network for semantic disambiguation and knowledge transferability improvement. Specifically, DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one. Then, a semantic-motivated instance decoder strengthens accurate cross-domain interactions between the matched pair for semantic-related instance adaption, encouraging the generation of unambiguous visual representations. Moreover, to mitigate the bias towards seen classes in GZSL, a debiasing loss is proposed to pursue response consistency between seen and unseen predictions. The PSVMA consistently yields superior performances against other state-of-the-art methods. Code will be available at: https://github.com/ManLiuCoder/PSVMA.
翻译:广义零样本学习(GZSL)通过从已知域传输知识,依赖于视觉和语义信息之间的内在交互来识别未知类别。先前的工作主要是定位与共享属性对应的区域。当各种视觉外观对应于相同的属性时,共享属性不可避免地会引入语义模糊性,从而妨碍准确的语义-视觉交互探索。在本文中,我们部署了双向语义-视觉转换器模块(DSVTM),通过逐步建模属性原型和视觉特征之间的对应关系,构建了进步式语义-视觉互适应(PSVMA)网络,以实现语义消歧和知识可转移性的提高。具体而言,DSVTM 设计了一个实例驱动的语义编码器,学习以实例为中心的原型以适应不同的图像,使得不匹配的语义-视觉对变为匹配的对。然后,一种语义驱动的实例解码器加强了匹配对之间的准确跨领域交互,鼓励生成明确的视觉表示。此外,为了减少 GZSL 中对已知类别的偏见,提出了去偏差损失,以追求已知和未知预测之间的响应一致性。PSVMA 方法一直表现出优越的性能,优于其他最先进的方法。代码将可在 https://github.com/ManLiuCoder/PSVMA 上下载。