Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multi-modal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pre-training (VLP) model. To facilitate transferring the image-text matching ability of VLP model, knowledge distillation is employed to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further enable the recognition of multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-the-art methods on public benchmark datasets. The source code is available at https://github.com/sunanhe/MKT.
翻译:现实世界的识别系统经常遇到隐蔽标签的挑战。为了识别这些隐蔽标签,多标签零光学习(ML-ZSL)侧重于通过经过培训的文本标签嵌入式(例如GloVe)来转让知识。然而,这些方法只利用语言模型的单一模式知识,而忽视图像-文本配对所固有的丰富的语义信息。相反,最近开发的基于开放词汇(OV)的方法成功地利用了在目标检测中的图像-文本配对信息,并取得了令人印象深刻的性能。在基于OV方法的成功的激励下,我们提出了一个创新的开放词汇框架,名为多模式知识传输(MKT),用于多标签分类分类。具体地说,我们的方法利用基于视觉和语言预培训模式(VLP)的图像-文本配对能力。为了便利VLP模型的图像-文本匹配能力转让,知识蒸馏用于进一步保证图像和标签嵌入对象的一致性,同时迅速调整了基于简单版本的多模式的多式知识/流化模型,从而进一步更新了全球标签模式。