Real-world recognition system often encounters a plenty of unseen labels in practice. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit singlemodal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multimodal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pretraining (VLP) model. To facilitate transferring the imagetext matching ability of VLP model, knowledge distillation is used to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further recognize multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-theart methods on public benchmark datasets. Code will be available at https://github.com/seanhe97/MKT.
翻译:现实世界识别系统在实践中经常遇到大量隐蔽的标签。为了识别这些隐蔽标签,多标签零光学习(ML-ZSL)侧重于通过经过培训的文本标签嵌入式(例如GloVe)来转让知识。然而,这些方法只利用语言模型的单一模式知识,而忽视图像-文本配对所固有的丰富的语义信息。相反,最近开发的基于开放词汇(OV)的方法成功地利用了在目标检测中的图像-文本配对信息,并取得了令人印象深刻的性能。在基于OV方法的成功激励下,我们提出了一个创新的开放词汇框架,名为多式知识传输(MKT),用于多标签分类分类。具体地说,我们的方法利用基于视觉和语言预培训(VLP)模型所固有的图像-文本配对的多模式知识。为了便于转让VLP模型的图像文本匹配能力,知识蒸馏将用来保证图像和标签嵌入的一致性,同时迅速对基于OV-OV方法的方法进行调整以进一步更新标签嵌入式的多版本/M 。要确认我们现有的公共数据库模块的多种方法,要展示。