Few-shot object detection aims at detecting novel objects with only a few annotated examples. Prior works have proved meta-learning a promising solution, and most of them essentially address detection by meta-learning over regions for their classification and location fine-tuning. However, these methods substantially rely on initially well-located region proposals, which are usually hard to obtain under the few-shot settings. This paper presents a novel meta-detector framework, namely Meta-DETR, which eliminates region-wise prediction and instead meta-learns object localization and classification at image level in a unified and complementary manner. Specifically, it first encodes both support and query images into category-specific features and then feeds them into a category-agnostic decoder to directly generate predictions for specific categories. To facilitate meta-learning with deep networks, we design a simple but effective Semantic Alignment Mechanism (SAM), which aligns high-level and low-level feature semantics to improve the generalization of meta-learned representations. Experiments over multiple few-shot object detection benchmarks show that Meta-DETR outperforms state-of-the-art methods by large margins.
翻译:先前的工作证明,这些方法主要是在各地区进行元化学习,以进行分类和地点微调;然而,这些方法在很大程度上依赖最初位置良好的区域提案,而这种提案通常在少数的环境下很难获得。本文提出了一个新的元检测框架,即Meta-DETR,它消除了区域预测,而代之以以以统一和互补的方式在图像层面进行元-Learns目标本地化和分类。具体地说,它首先将支持图像和查询图像编码为特定类别特性,然后将其输入一个分类的解码器,直接生成特定类别的预测。为了便利与深层网络进行元化学习,我们设计了一个简单而有效的语义调整机制,将高层次和低层次的特征语义结合起来,以改进元性表现的概括化。通过多发目标检测基准进行的实验表明,Meta-DETR以大幅度的方式超越了艺术状态方法。