Multi-label image classification is about predicting a set of class labels that can be considered as orderless sequential data. Transformers process the sequential data as a whole, therefore they are inherently good at set prediction. The first vision-based transformer model, which was proposed for the object detection task introduced the concept of object queries. Object queries are learnable positional encodings that are used by attention modules in decoder layers to decode the object classes or bounding boxes using the region of interests in an image. However, inputting the same set of object queries to different decoder layers hinders the training: it results in lower performance and delays convergence. In this paper, we propose the usage of primal object queries that are only provided at the start of the transformer decoder stack. In addition, we improve the mixup technique proposed for multi-label classification. The proposed transformer model with primal object queries improves the state-of-the-art class wise F1 metric by 2.1% and 1.8%; and speeds up the convergence by 79.0% and 38.6% on MS-COCO and NUS-WIDE datasets respectively.
翻译:多标签图像分类是预测一系列类标签,可被视为无序顺序数据。 变换器处理整个顺序数据,因此它们本身对设定预测十分有利。 第一个基于视觉的变异器模型,是为对象检测任务而提出的,它引入了对象查询的概念。 对象查询是可学习的定位编码,由调码器层的注意模块用来用图像中的利益区域解码对象类别或捆绑框。 但是,将同一组对象查询输入不同的解码器层会阻碍培训:它导致性能下降和延迟趋同。 在本文中,我们提议使用仅在变异器解码器堆开始时提供的原始对象查询。此外,我们改进了多标签分类的拟议混合技术。提议的带有原始对象查询的变异器模型可以将最先进的级智能F1测量器改进2.1%和1.8 %的状态;并加快MS-CO和NUS-WIDE数据集的趋同率,分别提高79.0%和38.6%。