Few-shot object detection (FSOD), with the aim to detect novel objects using very few training examples, has recently attracted great research interest in the community. Metric-learning based methods have been demonstrated to be effective for this task using a two-branch based siamese network, and calculate the similarity between image regions and few-shot examples for detection. However, in previous works, the interaction between the two branches is only restricted in the detection head, while leaving the remaining hundreds of layers for separate feature extraction. Inspired by the recent work on vision transformers and vision-language transformers, we propose a novel Fully Cross-Transformer based model (FCT) for FSOD by incorporating cross-transformer into both the feature backbone and detection head. The asymmetric-batched cross-attention is proposed to aggregate the key information from the two branches with different batch sizes. Our model can improve the few-shot similarity learning between the two branches by introducing the multi-level interactions. Comprehensive experiments on both PASCAL VOC and MSCOCO FSOD benchmarks demonstrate the effectiveness of our model.
翻译:为了利用极少的培训实例探测新物体,最近对社区产生了极大的研究兴趣。基于计量的学习方法已经证明,使用基于二分管的硅线网络,计算图像区域之间的相似性和用于检测的少量实例。然而,在以往的工程中,两个分支之间的互动仅限于探测头部,而将其余数百层留作独立地貌提取。在近期关于视觉变压器和视觉语言变压器的工作启发下,我们为FSOD提出了一个新的全跨端转换模型(FCT ), 将交叉转换器纳入功能主干柱和检测头。建议不对称的交叉注意将两个分支的关键信息与不同批量合并。我们的模型可以通过引入多层次的互动来改进两个分支之间的少量相似性学习。关于PASAL VOC 和 MSCOCO FSOD 基准的全面实验展示了我们模型的有效性。