Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. In this work we propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Our approach consists of a Transformer encoder trained to predict a set of target labels given an input set of masked labels, and visual features from a convolutional neural network. A key ingredient of our method is a label mask training objective that uses a ternary encoding scheme to represent the state of the labels as positive, negative, or unknown during training. Our model shows state-of-the-art performance on challenging datasets such as COCO and Visual Genome. Moreover, because our model explicitly represents the uncertainty of labels during training, it is more general by allowing us to produce improved results for images with partial or extra label annotations during inference. We demonstrate this additional capability in the COCO, Visual Genome, News500, and CUB image datasets.
翻译:多标签图像分类是预测一组与图像中的物体、属性或其他实体相对应的标签的任务。 在这项工作中,我们提出了分类变异器(C-Tran),这是一个多标签图像分类总框架,它利用变异器来利用视觉特征和标签之间的复杂依赖性。我们的方法包括一个经过训练的变异器编码器,用来预测一组目标标签,其中输入了一组隐蔽标签,以及一个神经网络的视觉特征。我们方法的一个关键成分是标签掩码培训目标,它使用一个永久编码方案来表示标签的状态,在培训期间是正的、负的或未知的。我们的模型展示了具有挑战性的数据集(如COCO和视觉基因组)方面的最先进的性能。此外,由于我们的模型明确代表了培训期间标签的不确定性,因此更普遍的做法是允许我们用部分或额外的标签说明来改进图像的结果。我们展示了COCO、视觉基因组、New500和CUB图像数据集中的这种额外能力。