We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection, which are complementary to each other by definition. Most of the previous works on multi-modal FSOD are fine-tuning-based which are inefficient for online applications. Moreover, these methods usually require expertise like class names to extract class semantic embedding, which are hard to get for rare classes. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning to learn generalizable few-shot and zero-shot object detection models respectively without fine-tuning. Specifically, we combine the few-shot visual classifier and text classifier learned via meta-learning and prompt-based learning respectively to build the multi-modal classifier and detection models. In addition, to fully exploit the pre-trained language models, we propose meta-learning-based cross-modal prompting to generate soft prompts for novel classes present in few-shot visual examples, which are then used to learn the text classifier. Knowledge distillation is introduced to learn the soft prompt generator without using human prior knowledge of class names, which may not be available for rare classes. Our insight is that the few-shot support images naturally include related context information and semantics of the class. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
翻译:在本文中,我们研究多式少发点物体探测(FSOD),同时使用少发目视实例和分类语义信息进行探测,从定义上加以补充。以前关于多发型物体探测(FSOD)的工作大多以微调为基础,对在线应用程序来说效率不高。此外,这些方法通常需要诸如类名等专门知识来提取类语义嵌入,而稀有类则难以获得。我们的方法的动机是高层次概念相似性(基于计量的)元学习和快速学习,以分别学习通用的少发和零发对象探测模型,而无需微调。具体地说,我们把通过元化学习而学到的少数发式视觉分类师和文本分类师结合起来,分别用来建立多发式分类和检测模型。此外,我们建议基于元学习的跨模式,为少数发式视觉范例中的新课提供软提示,然后用来学习文字分类分解的微和零发式物体探测模型,然后用来学习文本分解的物体探测模型。我们通过多发式学习微的视觉模型来学习软化的图像。我们先前的感知知的图像,不易读的模型可以用来学习。我们用来了解。