Generalized zero-shot learning (GZSL) is a technique to train a deep learning model to identify unseen classes using the image attribute. In this paper, we put forth a new GZSL approach exploiting Vision Transformer (ViT) to maximize the attribute-related information contained in the image feature. In ViT, the entire image region is processed without the degradation of the image resolution and the local image information is preserved in patch features. To fully enjoy these benefits of ViT, we exploit patch features as well as the CLS feature in extracting the attribute-related image feature. In particular, we propose a novel attention-based module, called attribute attention module (AAM), to aggregate the attribute-related information in patch features. In AAM, the correlation between each patch feature and the synthetic image attribute is used as the importance weight for each patch. From extensive experiments on benchmark datasets, we demonstrate that the proposed technique outperforms the state-of-the-art GZSL approaches by a large margin.
翻译:通用零光学习( GZSL) 是一种技术,用于培训一种深层次学习模型,用图像属性来识别看不见的类别。 在本文中,我们提出了一个新的 GZSL 方法,利用视觉变异器( VYT) 来尽量扩大图像特征中所含的属性相关信息。 在 ViT 中, 整个图像区域处理时不会图像分辨率退化, 本地图像信息保存在补丁特性中。 为了充分享受 ViT 的这些好处, 我们利用补丁特征和 CLS 特性来提取属性相关图像特征特征。 特别是, 我们提议了一个新的基于关注的模块, 称为属性注意模块( AAM), 将属性相关信息集中到补全特征中。 在 AAM 中, 每一个补丁特性和合成图像属性属性的关联性作为每个补丁的权重。 我们通过对基准数据集的广泛实验, 证明拟议的技术大大超越了与GZSLL的状态方法。