In this paper, we tackle the challenging problem of Few-shot Object Detection. Existing FSOD pipelines (i) use average-pooled representations that result in information loss; and/or (ii) discard position information that can help detect object instances. Consequently, such pipelines are sensitive to large intra-class appearance and geometric variations between support and query images. To address these drawbacks, we propose a Time-rEversed diffusioN tEnsor Transformer (TENET), which i) forms high-order tensor representations that capture multi-way feature occurrences that are highly discriminative, and ii) uses a transformer that dynamically extracts correlations between the query image and the entire support set, instead of a single average-pooled support embedding. We also propose a Transformer Relation Head (TRH), equipped with higher-order representations, which encodes correlations between query regions and the entire support set, while being sensitive to the positional variability of object instances. Our model achieves state-of-the-art results on PASCAL VOC, FSOD, and COCO.
翻译:在本文中,我们处理少见物体探测这一具有挑战性的问题。现有的FSOD管道(一) 使用平均共享的表示方式,造成信息丢失;和(或)(二) 丢弃能够帮助探测物体实例的定位信息。因此,这种管道对大型类内外观以及支持图像和查询图像之间的几何差异十分敏感。为了解决这些缺陷,我们建议使用一个时间-偏差 diffusioN tEnsor变异器(TENET),它构成高分级的传声器,捕捉高度歧视的多路特征事件;以及(二) 使用一个动态提取查询图像和整个支持组合之间相互关系的变压器,而不是单一的平均集合支持嵌入。我们还提出一个配置较高顺序的变压器关系头(TRH),该变压器将查询区域与整个支持组合联系起来,同时对物体的方位变异性十分敏感。我们的模型在PASAL VOC、FSOD和COCO上取得了最新结果。