This paper presents a simple and effective approach to solving the multi-label classification problem. The proposed approach leverages Transformer decoders to query the existence of a class label. The use of Transformer is rooted in the need of extracting local discriminative features adaptively for different labels, which is a strongly desired property due to the existence of multiple objects in one image. The built-in cross-attention module in the Transformer decoder offers an effective way to use label embeddings as queries to probe and pool class-related features from a feature map computed by a vision backbone for subsequent binary classifications. Compared with prior works, the new framework is simple, using standard Transformers and vision backbones, and effective, consistently outperforming all previous works on five multi-label classification data sets, including MS-COCO, PASCAL VOC, NUS-WIDE, and Visual Genome. Particularly, we establish $91.3\%$ mAP on MS-COCO. We hope its compact structure, simple implementation, and superior performance serve as a strong baseline for multi-label classification tasks and future studies. The code will be available soon at https://github.com/SlongLiu/query2labels.
翻译:本文介绍了解决多标签分类问题的简单而有效的方法。 提议的方法利用变换器解码器查询等级标签的存在。 变换器的使用根植于需要根据不同标签的适应性地提取本地歧视特征,这是因一个图像中存在多个对象而强烈希望的属性。 变换器解码器中的内在交叉注意模块提供了一种有效的方法,用标签嵌入来查询和集合由随后的二进制分类的愿景主干柱计算出来的与类别有关的特征。 与以前的工作相比,新框架是简单的,使用标准的变换器和愿景主干柱,并有效、持续地超过以前关于五套多标签分类数据集的所有工作,包括MS-CO、PASAL VOC、NUS-WIDE和视觉基因组。 特别是,我们在MS- CO上建立了913. $ mAP。 我们希望其紧凑结构、简单的实施和高性能作为多标签分类任务和未来研究的强有力基线。 代码将很快在 https://gistru/Slongu/Slongubque上公布。