Hierarchical classification (HC) assigns each object with multiple labels organized into a hierarchical structure. The existing deep learning based HC methods usually predict an instance starting from the root node until a leaf node is reached. However, in the real world, images interfered by noise, occlusion, blur, or low resolution may not provide sufficient information for the classification at subordinate levels. To address this issue, we propose a novel semantic guided level-category hybrid prediction network (SGLCHPN) that can jointly perform the level and category prediction in an end-to-end manner. SGLCHPN comprises two modules: a visual transformer that extracts feature vectors from the input images, and a semantic guided cross-attention module that uses categories word embeddings as queries to guide learning category-specific representations. In order to evaluate the proposed method, we construct two new datasets in which images are at a broad range of quality and thus are labeled to different levels (depths) in the hierarchy according to their individual quality. Experimental results demonstrate the effectiveness of our proposed HC method.
翻译:层次分类指将每个对象分配多个标签,这些标签组织成分层结构。现有基于深度学习的层次分类方法通常从根节点开始预测实例直到到达叶节点。然而,在现实世界中,受到噪声、遮挡、模糊或低分辨率等影响的图像可能对下级分类的分类不提供足够的信息。为了解决这个问题,我们提出了一种新的语义指导的层次-类别混合预测网络 (SGLCHPN),可以在端到端的方式下同时进行级别和类别的预测。SGLCHPN包括两个模块:一个视觉变换器,从输入图像中提取特征向量,和一个语义指导的交叉注意力模块,使用类别词嵌入作为查询来指导学习特定类别的表示。为了评估提出的方法,我们构建了两个新数据集,其中图像的质量范围广泛,因此根据它们的单个质量将其标记为层次结构的不同级别(深度)。实验结果证明了我们提出的层次分类方法的有效性。