Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories. Most GZSL methods typically learn to synthesize CNN visual features for the unseen classes by leveraging entire semantic information, e.g., tags and attributes, and the visual features of the seen classes. Within the visual features, we define two types of features that semantic-consistent and semantic-unrelated to represent the characteristics of images annotated in attributes and less informative features of images respectively. Ideally, the semantic-unrelated information is impossible to transfer by semantic-visual relationship from seen classes to unseen classes, as the corresponding characteristics are not annotated in the semantic information. Thus, the foundation of the visual feature synthesis is not always solid as the features of the seen classes may involve semantic-unrelated information that could interfere with the alignment between semantic and visual modalities. To address this issue, in this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture to factorize visual features of images into these two latent feature spaces to extract corresponding representations. Furthermore, a relation module is incorporated into this architecture to learn semantic-visual relationship, whilst a total correlation penalty is applied to encourage the disentanglement of two latent representations. The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images, which are further taken as the generation target for unseen classes. Extensive experiments conducted on seven GZSL benchmark datasets have verified the state-of-the-art performance of the proposal.
翻译:普通零热学习( GZSL) 旨在识别可见和不见类别的图像。 大多数 GZSL 方法通常通过利用全部语义信息( 如标记和属性) 以及可见类的视觉特征, 学习为隐蔽类合成CNN视觉特征。 在视觉特征中, 我们定义了两种类型的特征, 这些特征可能干扰语义一致性和语义无关的图像特征, 分别在图像的属性和较少信息的特点中附加说明。 理想的情况是, 语义- 不相干的信息不可能通过从可见类的语义- 视觉关系向隐蔽类的类别转移, 因为相应的特征在语义信息中没有附加说明。 因此, 视觉特征合成的基础并不总是牢固, 因为所见的语义特征可能包含与语义一致和语义模式不相关的信息。 为了解决这个问题, 我们建议一种新颖的特征分解方法, 以国家分解结构为基础, 将视觉图像的元素特征特性特性转化为这两个隐蔽类的隐含性特征, 将这一视觉结构与直径可理解的模型进行学习, 。 将这种视觉结构与直观结构与直观结构进行演化分析。